Skip to content

OCPEDGE-2498: Add migration support#83

Open
eggfoobar wants to merge 3 commits into
openshift-eng:mainfrom
eggfoobar:add-migration-support
Open

OCPEDGE-2498: Add migration support#83
eggfoobar wants to merge 3 commits into
openshift-eng:mainfrom
eggfoobar:add-migration-support

Conversation

@eggfoobar

@eggfoobar eggfoobar commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • Added a guided workflow to transition an OpenShift cluster from single-node to 3-node, including provisioning, bootstrapping, DNS updates, and verification steps.
    • Added new deployment commands for starting the transition and cleaning up the resulting topology changes.
  • Bug Fixes

    • Improved safety checks and interactive confirmations before making destructive cluster and VM changes.
    • Added automated recovery for certain rollout drain issues and clearer validation of cluster state during the transition.

@openshift-ci-robot

openshift-ci-robot commented Jun 26, 2026

Copy link
Copy Markdown

@eggfoobar: This pull request references OCPEDGE-2498 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 26, 2026
@openshift-ci openshift-ci Bot requested review from jerpeter1 and qJkee June 26, 2026 06:20
@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eggfoobar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2026
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR adds a complete Ansible-based automation workflow to transition a single-node OpenShift (SNO) cluster to a 3-node control plane, plus reverse cleanup. It includes new Makefile targets, shell entrypoint scripts, top-level playbooks, and a mutable-topology/sno-to-3node role with defaults, vars, ignition/VM provisioning tasks, topology/MCO transition logic, DNS updates, and verification.

Changes

SNO to 3-node transition automation

Layer / File(s) Summary
Entrypoints and Makefile targets
deploy/Makefile, deploy/openshift-clusters/scripts/sno-to-3node.sh, deploy/openshift-clusters/scripts/clean-mutable-topology.sh
New make targets and strict-mode bash scripts validate prerequisites and invoke the corresponding Ansible playbooks non-interactively.
Top-level playbooks
deploy/openshift-clusters/sno-to-3node.yml, deploy/openshift-clusters/clean-mutable-topology.yml
Playbooks targeting metal_machine pause for interactive confirmation, detect cluster domain, guard against unsafe re-runs (HighlyAvailable check), and invoke the role or its clean.yml tasks.
Role defaults and variables
.../defaults/main.yml, .../vars/main.yml
Defines cluster identity, master IPs, VM sizing, libvirt network settings, timeouts, and computed vars (kubeconfig path, dnsmasq paths, new node list).
Preflight validation
.../tasks/main.yml, .../tasks/preflight.yml
Wires nine ordered phases and validates platform/topology, node readiness, etcd/CO health, and remediates api-int DNS.
Ignition generation
.../tasks/ignition.yml, .../templates/master.ign.j2, .../templates/auto-install.ign.j2
Generates master.ign from the MCS CA cert and templates auto-install.ign for embedding into install ISOs.
VM provisioning and DHCP/DNS registration
.../tasks/create-vms.yml
Cleans stale state, generates MACs, creates disks, defines VMs via virt-install, registers DHCP/DNS, and verifies resolution.
RHCOS ISO boot and install
.../tasks/boot-nodes.yml
Fetches/embeds RHCOS ISOs, boots VMs for install, polls for completion, removes CDROM boot entry, and restarts from disk.
Node join and etcd waits
.../tasks/wait-nodes.yml, .../tasks/wait-etcd.yml
Approves CSRs, polls node readiness, then polls etcd readiness and prints diagnostics.
Topology patch and MCO rollout
.../tasks/topology.yml, .../tasks/mco-rollout.yml
Scales down CVO, patches topology to HighlyAvailable, and monitors MCO rollout with auto-fix for stuck drains.
DNS update and verification
.../tasks/update-dns.yml, .../tasks/verify.yml
Regenerates dnsmasq entries, verifies API/apps DNS, and asserts final node/topology/etcd success criteria.
Cleanup workflow
.../tasks/clean.yml
Destroys new VMs/disks/ISOs, removes DHCP/DNS/addnhosts entries, cleans lease file, and reloads dnsmasq/NetworkManager.

Estimated code review effort: 4 (Complex) | ~75 minutes

Possibly related PRs

  • openshift-eng/two-node-toolbox#72: The new entrypoint scripts validate aws-instance-id via get_node_dir from aws-hypervisor/scripts/common.sh, reworked in this related PR.

Suggested reviewers: fonta-rh, slintes, jaypoulz


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error New debug/echo output logs cluster domain, infra ID, node IPs/names, CSR names, and etcd/CO details. Remove or redact sensitive fields from logs (cluster domain, hostnames, IPs, CSR names, infra IDs) and keep only coarse status messages.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Weak-Crypto ✅ Passed Checked the added deploy/openshift-clusters files; no weak hashes/ciphers, custom crypto, or secret/token comparisons were found.
Container-Privileges ✅ Passed The PR adds Ansible playbooks/scripts, not container manifests, and no changed files contain privileged/hostPID/hostNetwork/hostIPC/SYS_ADMIN/allowPrivilegeEscalation settings.
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets found in the new SNO-to-3node playbooks/scripts; only paths and templated vars appear, not literal keys/tokens/passwords or embedded-credential URLs.
No-Injection-Vectors ✅ Passed No eval/exec/pickle/yaml.load/os.system/innerHTML sinks found; new shell usage is Ansible/bash with hardcoded or derived vars, not direct untrusted input.
Ai-Attribution ✅ Passed No AI tool use is mentioned in the commit/PR context; HEAD has only Signed-off-by and no Co-Authored-By, Assisted-by, or Generated-by trailers.
Title check ✅ Passed The title is related to the change set and correctly signals new migration support, though it is broader than the specific SNO-to-3-node work.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
deploy/Makefile (1)

122-130: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Document the new sno-ipi and sno-agent targets in make help.

Lines 79-83 add both commands, but this help block only advertises sno-to-3node. make help will miss the new entrypoints entirely.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/Makefile` around lines 122 - 130, Update the Makefile help output so
the new `sno-ipi` and `sno-agent` targets are listed alongside the existing
deployment entries in `make help`. Add matching `@echo` lines in the help block
near the `sno-to-3node` entry, using the same wording/style as the other target
descriptions so the new entrypoints are discoverable.
deploy/openshift-clusters/scripts/deploy-cluster.sh (1)

62-69: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject the unsupported sno + kcli combination explicitly.

This validation now allows --topology sno for every method, so a direct deploy-cluster.sh --topology sno --method kcli falls through to kcli-install.yml. The PR only adds SNO support through the install-dev/setup flow, so this opens an unsupported path instead of failing fast.

Suggested fix
 if [[ "${METHOD}" != "ipi" && "${METHOD}" != "agent" && "${METHOD}" != "kcli" ]]; then
     echo "Error: Invalid method '${METHOD}'. Must be 'ipi', 'agent', or 'kcli'."
     exit 1
 fi
+
+if [[ "${TOPOLOGY}" == "sno" && "${METHOD}" == "kcli" ]]; then
+    echo "Error: Topology 'sno' only supports 'ipi' or 'agent'."
+    exit 1
+fi

Also applies to: 87-104

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/deploy-cluster.sh` around lines 62 - 69,
The validation in deploy-cluster.sh currently permits every METHOD with
TOPOLOGY=sno, which allows an unsupported sno + kcli path to continue into the
kcli install flow. Update the argument checks around the TOPOLOGY/METHOD
validation blocks to explicitly reject the unsupported combination before any
install logic runs, and keep the failure consistent with the existing validation
messages. Reference the validation section and the kcli-install path so the
guard is placed where the topology and method are first checked.
deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml (1)

26-32: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Include the SNO scenario in the missing-variable error.

Line 31 still only tells agent users to add TNF_IPV4 or TNA_IPV4. Now that sno is a supported prefix on Line 43, a missing AGENT_E2E_TEST_SCENARIO for SNO will produce the wrong remediation.

Proposed fix
-      AGENT_E2E_TEST_SCENARIO="TNF_IPV4" (fencing) or "TNA_IPV4" (arbiter) to
-      the config file.
+      AGENT_E2E_TEST_SCENARIO="TNF_IPV4" (fencing), "TNA_IPV4" (arbiter), or
+      "SNO_IPV4" (sno) to the config file.

Also applies to: 43-43

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml`
around lines 26 - 32, The missing-AGENT_E2E_TEST_SCENARIO failure message in the
config validation task should be updated to include the SNO option now supported
by the method-prefix logic. Adjust the fail message in the config validation
flow so it lists the correct remediation for all supported prefixes, including
the new sno case alongside the existing agent scenarios, and make sure the
guidance matches the prefix handling used in the task that selects by method.
🧹 Nitpick comments (2)
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the repository-standard Bash shebang.

This new wrapper still uses #!/bin/bash, but the repo shell-script rule requires #!/usr/bin/bash. As per coding guidelines, "Shell scripts should use #!/usr/bin/bash shebang".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` around lines 1 - 2, The
wrapper script currently uses the non-standard Bash shebang, so update the
script header to match the repository convention by changing the shebang at the
top of sno-to-3node.sh to the repo-standard Bash path. Keep the rest of the
script unchanged, and verify the file still starts with the expected shebang
followed by set -euo pipefail.

Source: Coding guidelines

deploy/openshift-clusters/scripts/clean-mutable-topology.sh (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the repository-standard Bash shebang.

This new wrapper still uses #!/bin/bash, but the repo shell-script rule requires #!/usr/bin/bash. As per coding guidelines, "Shell scripts should use #!/usr/bin/bash shebang".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 -
2, The clean-mutable-topology.sh wrapper uses the wrong Bash shebang, so update
the script header to match the repository standard. Replace the current shebang
at the top of the file with the repo-required Bash interpreter line, keeping the
rest of the script unchanged; this is the only fix needed in the script’s
startup header.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/openshift-clusters/clean-mutable-topology.yml`:
- Around line 35-39: The cleanup playbook currently imports
mutable-topology/sno-to-3node tasks/clean.yml unconditionally, which can destroy
a healthy HA control plane. Add a topology check in clean-mutable-topology.yml
before the import_role so it only runs when the cluster is not already HA, or
require an explicit force_cleanup=true override to proceed. Use the existing
import_role task as the entry point and gate it with a condition or precheck
that prevents destructive cleanup unless the override is set.

In
`@deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh`:
- Around line 35-40: This example hardcodes an x86_64 release image and leaves
out the ARM64-specific Metal3 overrides, so update the config example to
explicitly document how to switch to aarch64/Graviton. Use
OPENSHIFT_RELEASE_IMAGE as the payload architecture selector and add the
required Metal3 container override guidance (IRONIC_IMAGE, VBMC_IMAGE,
SUSHY_TOOLS_IMAGE) in the example so the install path is correct for ARM64
hypervisors. Refer to the config_sno_example.sh example block and keep the
existing installer/image-policy comments consistent with the new ARM64 notes.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 68-72: The cleanup task in clean.yml is deleting the original SNO
node’s api-int DNS entry, which can break the surviving cluster after cleanup.
Update the [clean] Remove api-int from libvirt network DNS task to target only
mutable-topology-added resources associated with sno_new_nodes, and avoid
removing the api-int record for sno_master0_ip in the libvirt network update
logic.
- Around line 37-43: The embedded Python in the cleanup task scripts is
indented, which causes the inline `python3 -c` snippets used in `clean.yml` to
fail before the DHCP cleanup logic runs. Update the scripts in the task blocks
that use `EXISTING_MAC` and the lease cleanup command so the Python content is
left-aligned or otherwise dedented, and keep the logic under the same task names
so `|| true` / `failed_when: false` no longer hides a parse failure.
- Around line 26-30: The per-node ISO cleanup in the [clean] task is using a
hostname-based filename that can collide across clusters on the same hypervisor.
Update the ISO naming scheme in the related ISO creation and cleanup flow to use
a cluster-scoped identifier, such as item.name or a sno_cluster_name-prefixed
name, and make the file path in the cleanup task match that same unique pattern.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`:
- Around line 27-32: The DNS cleanup in the `create-vms.yml` task is still
targeting `{{ item.ip }}`, so stale `dns-host` records for `master-1`/`master-2`
with old IPs are not removed before `virsh net-update add dns-host` runs. Update
the cleanup logic around the `virsh net-update` calls to delete entries by
hostname instead of by IP, and then tighten the `dig` verification so it checks
that the resolved answers actually contain `{{ item.ip }}` rather than only a
non-empty response.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml`:
- Around line 19-24: The MCS reachability check in the ignition task is
currently allowed to fail and continue, which defers a required configuration
error until boot. Update the existing "[ignition] Verify MCS reachability" step
so it aborts when the curl probe to the MCS health endpoint fails, and only keep
a non-failing path if you explicitly add a break-glass override. Use the
existing mcs_health probe logic as the place to enforce fail-fast behavior.
- Around line 13-18: The ignition generation step writes /tmp/master.ign but
never populates sno_master_ign_b64, so auto-install.ign.j2 cannot render
reliably. Update the relevant task flow in ignition.yml to read back the
generated master.ign and set sno_master_ign_b64 before the template that
consumes it, using the existing ignition generation and auto-install rendering
steps as the anchor points. Ensure the variable is produced within the role
rather than assuming an external caller supplies it.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml`:
- Around line 68-81: The MCO rollout polling in the mco-rollout task currently
suppresses timeout failures by using failed_when: false, which allows the play
to continue even if mcp_poll never reports completion. Update the
polling/verification logic in the mcp_poll task so a timeout is fatal by
default, or make the non-fatal continuation explicitly opt-in via a best-effort
mode. Keep the success check based on mcp_poll.stdout and ensure the final MCO
rollout result step reflects failure when the rollout does not complete.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 80-120: The preflight checks in the etcd probe and the
cluster-operator health check are only logging warnings instead of stopping the
topology transition, which makes this phase non-gating. Update the preflight
logic so the tasks for the etcd pod check and the degraded CO check fail the
play by default, removing the unconditional `|| true` and the `failed_when:
false` behavior; keep any bypass only behind an explicit break-glass override.
Use the existing task names like "[preflight] Check etcd pods" and "[preflight]
Check for degraded cluster operators" to locate the affected steps.
- Around line 129-154: The preflight repair flow for api-int only deletes
entries that already match sno_master0_ip and the verification step only checks
for any DNS answer, so stale mappings can remain undetected. Update the shell
logic in the api-int repair task to remove existing api-int entries by
hostname/FQDN regardless of IP, then make the dns_apiint_verify check confirm
the resolved result includes sno_master0_ip rather than just being non-empty.
Use the existing preflight tasks and variables like sno_libvirt_network,
sno_master0_ip, and sno_cluster_domain to locate and adjust the repair and
verification steps.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml`:
- Around line 24-31: The dnsmasq config in the update-dns task is incorrectly
pinning the wildcard apps record to sno_all_node_ips[0], which can vary because
node ordering is not stable. Update the logic in the sno_dnsmasq_lines
generation to use a stable ingress target such as a VIP/load balancer, or
otherwise emit all intended ingress node IPs instead of selecting the first node
IP. Make the change in the set_fact block that builds the DNS entries so the
*.apps record does not depend on array order.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml`:
- Around line 28-48: The wait-etcd task only checks for 3 Ready pods, while the
etcd membership and endpoint health probes in the same flow still suppress
failures via failed_when: false and fallback echo text. Update the wait logic in
wait-etcd.yml so the shell checks using oc exec and etcdctl member list /
endpoint health are allowed to fail the task, or fold their success into the
retry condition, so the role does not proceed to topology.yml until etcd quorum
and health are actually good.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml`:
- Around line 4-21: The CSR approval logic in the wait-nodes task is too broad
because it approves every pending request returned by oc get csr. Update the
pending-CSR collection and approval loop in the shell block so it only selects
kubelet client/server CSRs associated with sno_new_nodes, and then approve only
those entries before calling oc adm certificate approve. Use the existing
PENDING/CSR handling in wait-nodes.yml as the place to narrow the filter.

In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml`:
- Around line 2-5: The kubeconfig path in sno_kubeconfig_resolved is incorrectly
derived from the default sno_cluster_name, so it can point to the wrong cluster
before discovery happens. Update the mutable-topology SNO-to-3node vars so
sno_cluster_name is not assumed here; instead require sno_cluster_name or
sno_kubeconfig to be provided up front, or change the resolution logic to use a
source that does not depend on the default value. Keep the fix localized to the
sno_kubeconfig_resolved expression and related variable handling.

---

Outside diff comments:
In `@deploy/Makefile`:
- Around line 122-130: Update the Makefile help output so the new `sno-ipi` and
`sno-agent` targets are listed alongside the existing deployment entries in
`make help`. Add matching `@echo` lines in the help block near the
`sno-to-3node` entry, using the same wording/style as the other target
descriptions so the new entrypoints are discoverable.

In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml`:
- Around line 26-32: The missing-AGENT_E2E_TEST_SCENARIO failure message in the
config validation task should be updated to include the SNO option now supported
by the method-prefix logic. Adjust the fail message in the config validation
flow so it lists the correct remediation for all supported prefixes, including
the new sno case alongside the existing agent scenarios, and make sure the
guidance matches the prefix handling used in the task that selects by method.

In `@deploy/openshift-clusters/scripts/deploy-cluster.sh`:
- Around line 62-69: The validation in deploy-cluster.sh currently permits every
METHOD with TOPOLOGY=sno, which allows an unsupported sno + kcli path to
continue into the kcli install flow. Update the argument checks around the
TOPOLOGY/METHOD validation blocks to explicitly reject the unsupported
combination before any install logic runs, and keep the failure consistent with
the existing validation messages. Reference the validation section and the
kcli-install path so the guard is placed where the topology and method are first
checked.

---

Nitpick comments:
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-2: The clean-mutable-topology.sh wrapper uses the wrong Bash
shebang, so update the script header to match the repository standard. Replace
the current shebang at the top of the file with the repo-required Bash
interpreter line, keeping the rest of the script unchanged; this is the only fix
needed in the script’s startup header.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Around line 1-2: The wrapper script currently uses the non-standard Bash
shebang, so update the script header to match the repository convention by
changing the shebang at the top of sno-to-3node.sh to the repo-standard Bash
path. Keep the rest of the script unchanged, and verify the file still starts
with the expected shebang followed by set -euo pipefail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d1bb47f1-0e94-4c52-bf4c-ea4c9efe1020

📥 Commits

Reviewing files that changed from the base of the PR and between 903e071 and 497dfcc.

📒 Files selected for processing (27)
  • deploy/Makefile
  • deploy/openshift-clusters/clean-mutable-topology.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/files/.gitignore
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
  • deploy/openshift-clusters/scripts/clean-mutable-topology.sh
  • deploy/openshift-clusters/scripts/deploy-cluster.sh
  • deploy/openshift-clusters/scripts/sno-to-3node.sh
  • deploy/openshift-clusters/sno-to-3node.yml

Comment on lines +35 to +39
tasks:
- name: Run mutable topology cleanup
import_role:
name: mutable-topology/sno-to-3node
tasks_from: clean.yml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Refuse cleanup once the cluster is already HA unless the caller forces it.

This imports tasks/clean.yml unconditionally, and that role deletes master-1/master-2 plus their disks and DHCP state. On a successful 3-node cluster, removing two control-plane nodes here can drop etcd from 3 members to 1 and take the cluster down. Please gate this on the current control-plane topology, or require an explicit force_cleanup=true override before running the destructive role.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 35 - 39,
The cleanup playbook currently imports mutable-topology/sno-to-3node
tasks/clean.yml unconditionally, which can destroy a healthy HA control plane.
Add a topology check in clean-mutable-topology.yml before the import_role so it
only runs when the cluster is not already HA, or require an explicit
force_cleanup=true override to proceed. Use the existing import_role task as the
entry point and gate it with a condition or precheck that prevents destructive
cleanup unless the override is set.

Comment on lines +129 to +154
- name: "[preflight] Add api-int to libvirt network DNS"
# Using virsh net-update (not lineinfile) so that api-int survives addnhosts
# regeneration when create-vms.yml calls virsh net-update for master-1/2.
# libvirt regenerates ostestbm.addnhosts from the network XML on every net-update.
shell: |
# Remove any stale api-int entry (idempotent, ignore errors)
sudo virsh net-update {{ sno_libvirt_network }} delete dns-host \
"<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \
--live --config 2>/dev/null || true
# Add the correct entry - triggers addnhosts regeneration with api-int included
sudo virsh net-update {{ sno_libvirt_network }} add dns-host \
"<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \
--live --config
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout
changed_when: true

- name: "[preflight] Verify api-int DNS after fix"
shell: |
dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1
register: dns_apiint_verify
changed_when: false
failed_when: dns_apiint_verify.stdout | trim == ""
retries: 3
delay: 2
until: dns_apiint_verify.stdout | trim != ""
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

The api-int repair path does not remove or verify stale mappings correctly.

The delete call matches api-int on {{ sno_master0_ip }}, so it only removes a record that already points at the desired IP. If api-int is stale and points somewhere else, the old XML entry survives, and the follow-up dig still passes because it only checks for any answer, not for {{ sno_master0_ip }}. Remove existing api-int entries by hostname/FQDN and verify the resolved answers include the expected IP.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`
around lines 129 - 154, The preflight repair flow for api-int only deletes
entries that already match sno_master0_ip and the verification step only checks
for any DNS answer, so stale mappings can remain undetected. Update the shell
logic in the api-int repair task to remove existing api-int entries by
hostname/FQDN regardless of IP, then make the dns_apiint_verify check confirm
the resolved result includes sno_master0_ip rather than just being non-empty.
Use the existing preflight tasks and variables like sno_libvirt_network,
sno_master0_ip, and sno_cluster_domain to locate and adjust the repair and
verification steps.

@eggfoobar eggfoobar force-pushed the add-migration-support branch from 497dfcc to 134e627 Compare June 30, 2026 04:32
@coderabbitai coderabbitai Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 30, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deploy/Makefile (1)

135-144: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add the new SNO deploy targets to make help.

make help documents the migration and cleanup entrypoints, but sno-ipi and sno-agent are still missing from the deployment list, so the new deploy flow is easy to miss.

Suggested patch
 	`@echo` "  fencing-assisted     - Deploy hub + spoke TNF cluster via assisted installer"
+	`@echo` "  sno-ipi              - Deploy SNO IPI cluster (non-interactive)"
+	`@echo` "  sno-agent            - Deploy SNO Agent cluster (non-interactive)"
 	`@echo` "  sno-to-3node         - Transition existing SNO cluster to 3-node HA (platform:none)"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/Makefile` around lines 135 - 144, The help output in Makefile is
missing the new SNO deployment entrypoints, so update the `make help` target’s
OpenShift Cluster Management section to include `sno-ipi` and `sno-agent`
alongside the existing deploy/cleanup commands. Keep the new entries grouped
with the other deployment targets and use the same `@echo` pattern in the `help`
recipe so the new flow is discoverable.
🧹 Nitpick comments (3)
deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh (1)

1-41: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make this file clearly a config fragment or a real script.

It currently has a Bash shebang, but it does not follow the repo’s shell-script contract (#!/usr/bin/bash and set -euo pipefail). Either drop the shebang and document it as a sourced config example, or make it conform to the script standard. As per coding guidelines, "Shell scripts should use #!/usr/bin/bash shebang" and "Include error handling with set -euo pipefail".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh`
around lines 1 - 41, The file is currently ambiguous between a sourced config
fragment and an executable shell script, and its header does not match the repo
shell-script contract. Either remove the shebang and clearly treat it as a
sourced config example, or convert it into a proper script by updating the
header in config_sno_example.sh to use the required shell standard and add
strict error handling near the top. Make the choice consistent with how this
file is used, and keep the existing export-based config values under that
contract.

Source: Coding guidelines

deploy/openshift-clusters/scripts/sno-to-3node.sh (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the repository Bash shebang.

This wrapper already follows the repo's strict-mode pattern, but the shebang should be #!/usr/bin/bash to match the shell-script standard.

Suggested patch
-#!/bin/bash
+#!/usr/bin/bash

As per coding guidelines, "Shell scripts should use #!/usr/bin/bash shebang".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` around lines 1 - 2, The
shell script header in sno-to-3node.sh should use the repository-standard Bash
interpreter path instead of the generic Bash shebang. Update the script’s
shebang at the top of the file to match the shell-script convention used across
the repo, while keeping the existing strict-mode settings unchanged.

Source: Coding guidelines

deploy/openshift-clusters/scripts/clean-mutable-topology.sh (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the repository Bash shebang.

This wrapper already follows the repo's strict-mode pattern, but the shebang should be #!/usr/bin/bash to match the shell-script standard.

Suggested patch
-#!/bin/bash
+#!/usr/bin/bash

As per coding guidelines, "Shell scripts should use #!/usr/bin/bash shebang".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 -
2, The script uses the wrong Bash shebang at the top of
clean-mutable-topology.sh; update the shebang to the repository-standard
/usr/bin/bash while keeping the existing strict-mode settings intact. Make this
change at the script header so the wrapper matches the repo shell-script
convention used alongside set -euo pipefail.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml`:
- Around line 15-19: The `oc project` retry block in `main.yml` is swallowing
failures because `failed_when: false` lets the handler succeed even after all
retries are exhausted. Update the handler logic around `oc_project_result` so
that a nonzero `rc` after the final retry actually fails the task, or at minimum
emits an explicit warning/error on the last attempt; keep the retry behavior
intact but make the terminal failure visible.

In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml`:
- Line 43: The missing-scenario remediation for agent methods is incomplete
because the guidance still only mentions TNF_IPV4 and TNA_IPV4 even though sno
is now supported; update the failure message in the agent method validation
logic to include SNO and point users to SNO_IPV4 as well. Make the change in the
same place that checks AGENT_E2E_TEST_SCENARIO so the message stays aligned with
the scenario mapping used by sno.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`:
- Around line 57-58: The cleanup in boot-nodes is leaving UEFI guest definitions
behind because the `virsh undefine` call in the VM teardown sequence does not
remove the OVMF NVRAM varstore. Update the `virsh undefine` usage in the
`boot-nodes` task so it includes `--nvram` for these guests, keeping the
existing `virsh destroy`/cleanup flow in place and ensuring the old domain is
fully removed before the next `virt-install` runs.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 114-124: The cleanup in the `clean.yml` task only reloads libvirt
dnsmasq, but it leaves the NetworkManager dnsmasq override created by
`update-dns.yml` in place. Update the `[clean]` flow to also remove `{{
sno_nm_dnsmasq_conf }}` and then reload or restart `NetworkManager` so stale
local DNS entries for `api`, `api-int`, and `*.apps` are cleared. Use the
existing cleanup block near `[clean] Reload libvirt dnsmasq` and `[clean]
Cleanup complete` to add this DNS teardown step.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`:
- Around line 130-137: The "[create-vms] Add DNS entries to libvirt network"
task currently hides failures by setting failed_when: false on the virsh
net-update add dns-host call. Remove the failure masking so the task in
create-vms.yml fails when virsh net-update cannot update dns-host, and keep the
existing loop over sno_new_nodes so each hostname update is validated. If a
fallback is needed, handle it explicitly in a separate step rather than allowing
this task to report success when the libvirt DNS update did not apply.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml`:
- Around line 2-7: The `ignition.yml` probes are unbounded and can hang the role
during API or DNS stalls. Update the `"[ignition] Extract MCS CA certificate"`
task and the later `curl` probe task to use explicit timeouts for the `oc` and
`curl` commands, keeping the existing task structure and symbols like
`mcs_ca_raw` and the probe step intact. Ensure both commands fail quickly
instead of waiting indefinitely.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml`:
- Around line 42-64: The drain auto-fix currently applies to every cordoned node
in the mco-rollout task, which can bypass normal MCO drains or intentional
maintenance cordons. Update the logic around CORDONED_NODES so it only runs when
there is a stale MCO-drain signal, or change sno_auto_fix_drain to be opt-in by
default instead of default(true). Keep the uncordon/lastAppliedDrain/MCC restart
flow behind that stricter gate so normal SchedulingDisabled nodes are not
touched.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 135-143: The embedded Python passed to python3 -c in the preflight
task is not indented under the shell block, so the YAML parser can misread it as
new keys and fail to load the file. Fix the shell: | content in preflight.yml by
indenting the full Python snippet consistently beneath the CURRENT_IP assignment
while keeping the same logic in the task that queries virsh net-dumpxml for
api-int.{{ sno_cluster_domain }}.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml`:
- Around line 2-5: The CVO disable step in the topology transition flow is not
protected by a restore path, so a failure can leave the cluster without a
running CVO. Update the SNO-to-3node topology task sequence around the
scale-down step to use a block with an always section that scales
cluster-version-operator back up, or add an explicit restore task after the
rollout phases if CVO must remain down temporarily. Make sure the fix is tied to
the existing topology task(s) that perform the CVO scale-down so the recovery
runs even when later phases fail.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml`:
- Around line 57-69: The DNS verification in update-dns.yml is too weak because
the API check only tests for any non-empty result and the apps check never
fails, so phase 8 can pass with stale or partial records. Tighten the validation
in the existing dns verification tasks by comparing the dig results against
sno_all_node_ips[0] for api and sno_all_node_ips for test.apps, and make the
checks fail/retry until the returned records match instead of accepting any
response.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`:
- Around line 91-111: The final verification in the [verify] task only checks
the number of node rows and can pass even if some nodes are NotReady. Update the
existing node verification around verify_nodes so it explicitly asserts all
three nodes are Ready before the migration success message is emitted, and keep
the HA/etcd checks as-is. Use the verify_nodes assertion block and the "[verify]
SNO to 3-node transition verified successfully" debug step to locate where to
add the readiness condition.

In `@deploy/openshift-clusters/scripts/deploy-cluster.sh`:
- Around line 62-63: The topology validation in deploy-cluster.sh currently
allows sno unconditionally, which lets unsupported kcli SNO requests pass into
the playbook and fail later. Update the TOPOLOGY check in the deploy-cluster.sh
flow so sno is only accepted for supported deployment methods, using the
existing method selection logic in the script, or alternatively align
kcli-install/tasks/validate.yml so kcli explicitly supports sno. Keep the fix
centered around the topology validation gate and the kcli validation path.

---

Outside diff comments:
In `@deploy/Makefile`:
- Around line 135-144: The help output in Makefile is missing the new SNO
deployment entrypoints, so update the `make help` target’s OpenShift Cluster
Management section to include `sno-ipi` and `sno-agent` alongside the existing
deploy/cleanup commands. Keep the new entries grouped with the other deployment
targets and use the same `@echo` pattern in the `help` recipe so the new flow is
discoverable.

---

Nitpick comments:
In
`@deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh`:
- Around line 1-41: The file is currently ambiguous between a sourced config
fragment and an executable shell script, and its header does not match the repo
shell-script contract. Either remove the shebang and clearly treat it as a
sourced config example, or convert it into a proper script by updating the
header in config_sno_example.sh to use the required shell standard and add
strict error handling near the top. Make the choice consistent with how this
file is used, and keep the existing export-based config values under that
contract.

In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-2: The script uses the wrong Bash shebang at the top of
clean-mutable-topology.sh; update the shebang to the repository-standard
/usr/bin/bash while keeping the existing strict-mode settings intact. Make this
change at the script header so the wrapper matches the repo shell-script
convention used alongside set -euo pipefail.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Around line 1-2: The shell script header in sno-to-3node.sh should use the
repository-standard Bash interpreter path instead of the generic Bash shebang.
Update the script’s shebang at the top of the file to match the shell-script
convention used across the repo, while keeping the existing strict-mode settings
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2df4a7c9-40e6-4c41-b37a-324e24187240

📥 Commits

Reviewing files that changed from the base of the PR and between 52f92a5 and 134e627.

📒 Files selected for processing (27)
  • deploy/Makefile
  • deploy/openshift-clusters/clean-mutable-topology.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/files/.gitignore
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
  • deploy/openshift-clusters/scripts/clean-mutable-topology.sh
  • deploy/openshift-clusters/scripts/deploy-cluster.sh
  • deploy/openshift-clusters/scripts/sno-to-3node.sh
  • deploy/openshift-clusters/sno-to-3node.yml

Comment on lines +15 to +19
register: oc_project_result
retries: 5
delay: 15
until: oc_project_result.rc == 0
failed_when: false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don’t swallow an exhausted oc project failure.

After all retries are used, failed_when: false still lets this handler pass, so a real kubeconfig/auth/API failure becomes silent and the role continues without ever selecting the target project. Fail when oc_project_result.rc != 0, or emit an explicit warning/error on the terminal retry.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml`
around lines 15 - 19, The `oc project` retry block in `main.yml` is swallowing
failures because `failed_when: false` lets the handler succeed even after all
retries are exhausted. Update the handler logic around `oc_project_result` so
that a nonzero `rc` after the final retry actually fails the task, or at minimum
emits an explicit warning/error on the last attempt; keep the retry behavior
intact but make the terminal failure visible.

expected_prefix:
arbiter: "TNA"
fencing: "TNF"
sno: "SNO"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Update the missing-scenario remediation for sno too.

Adding sno here is incomplete: the earlier Fail if agent method is used without AGENT_E2E_TEST_SCENARIO message still only tells users to set TNF_IPV4 or TNA_IPV4. agent + sno will now fail with stale guidance instead of pointing to SNO_IPV4.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml` at
line 43, The missing-scenario remediation for agent methods is incomplete
because the guidance still only mentions TNF_IPV4 and TNA_IPV4 even though sno
is now supported; update the failure message in the agent method validation
logic to include SNO and point users to SNO_IPV4 as well. Make the change in the
same place that checks AGENT_E2E_TEST_SCENARIO so the message stays aligned with
the scenario mapping used by sno.

Comment on lines +57 to +58
sudo virsh destroy "$VM_NAME" 2>/dev/null || true
sudo virsh undefine "$VM_NAME" 2>/dev/null || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== boot-nodes.yml ==\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml | sed -n '1,140p'

printf '\n== create-vms.yml ==\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml | sed -n '1,220p'

Repository: openshift-eng/two-node-toolbox

Length of output: 230


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== libvirt semantics search ==\n'
python3 - <<'PY'
import urllib.request, urllib.parse, re
query = urllib.parse.quote('libvirt virsh undefine --nvram OVMF NVRAM semantics')
url = f'https://duckduckgo.com/html/?q={query}'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
with urllib.request.urlopen(req, timeout=20) as resp:
    html = resp.read().decode('utf-8', 'ignore')
for pat in [r'--nvram', r'undefine', r'NVRAM', r'OVMF']:
    print(f'[{pat}]', 'found' if re.search(pat, html, re.I) else 'not found')
print(html[:2000])
PY

Repository: openshift-eng/two-node-toolbox

Length of output: 2794


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== boot-nodes.yml ==\n'
sed -n '1,140p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml

printf '\n== create-vms.yml ==\n'
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml

printf '\n== repo virsh undefine search ==\n'
grep -RIn --exclude-dir=.git -E 'virsh undefine|--nvram' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node | sed -n '1,120p'

Repository: openshift-eng/two-node-toolbox

Length of output: 11126


🌐 Web query:

libvirt virsh undefine --nvram OVMF NVRAM documentation

💡 Result:

When using the virsh undefine command on a domain that utilizes UEFI/NVRAM, libvirt requires explicit instruction on how to handle the associated NVRAM file [1][2]. If a domain has an NVRAM file defined in its configuration and you attempt to undefine it without providing either the --nvram or --keep-nvram flag, the operation will fail with an error [1][3][4]. The relevant flags are: --nvram: Explicitly instructs libvirt to remove the associated NVRAM (varstore) file when the domain is undefined [5][2]. --keep-nvram: Explicitly instructs libvirt to retain the NVRAM file after the domain has been undefined [6][2]. This requirement was implemented to prevent accidental deletion of persistent NVRAM data (such as UEFI variables) [1][7]. If the domain is inactive, the undefine command removes its persistent configuration [8][2]. If the domain is active (running), it becomes transient—meaning it will continue to run until it is shut down, at which point it will be removed [8][2]. In both cases, the choice to delete or preserve the NVRAM file must be specified using these flags [6].

Citations:


virsh undefine needs --nvram for these UEFI guests. virt-install defines each VM with an OVMF NVRAM varstore, so this cleanup can fail and leave the old domain behind; || true then hides the error until the next virt-install hits “domain already exists.”

Suggested fix
-    sudo virsh undefine "$VM_NAME" 2>/dev/null || true
+    sudo virsh undefine "$VM_NAME" --nvram 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sudo virsh destroy "$VM_NAME" 2>/dev/null || true
sudo virsh undefine "$VM_NAME" 2>/dev/null || true
sudo virsh destroy "$VM_NAME" 2>/dev/null || true
sudo virsh undefine "$VM_NAME" --nvram 2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`
around lines 57 - 58, The cleanup in boot-nodes is leaving UEFI guest
definitions behind because the `virsh undefine` call in the VM teardown sequence
does not remove the OVMF NVRAM varstore. Update the `virsh undefine` usage in
the `boot-nodes` task so it includes `--nvram` for these guests, keeping the
existing `virsh destroy`/cleanup flow in place and ensuring the old domain is
fully removed before the next `virt-install` runs.

Comment on lines +114 to +124
- name: "[clean] Reload libvirt dnsmasq"
shell: |
sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true
changed_when: true
failed_when: false

- name: "[clean] Cleanup complete"
debug:
msg: >-
Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}.
Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Cleanup leaves the NetworkManager dnsmasq override behind.

update-dns.yml creates {{ sno_nm_dnsmasq_conf }} and restarts NetworkManager, but this cleanup only HUPs libvirt dnsmasq. After the VMs and libvirt DNS records are gone, the host can still resolve api, api-int, and *.apps from stale local overrides. Remove that file and reload/restart NetworkManager here as part of the DNS cleanup path.

Suggested cleanup addition
+- name: "[clean] Remove NM dnsmasq override"
+  file:
+    path: "{{ sno_nm_dnsmasq_conf }}"
+    state: absent
+  become: true
+  failed_when: false
+
+- name: "[clean] Restart NetworkManager to drop local DNS overrides"
+  systemd:
+    name: NetworkManager
+    state: restarted
+  become: true
+  failed_when: false
+
 - name: "[clean] Reload libvirt dnsmasq"
   shell: |
     sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: "[clean] Reload libvirt dnsmasq"
shell: |
sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true
changed_when: true
failed_when: false
- name: "[clean] Cleanup complete"
debug:
msg: >-
Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}.
Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed.
- name: "[clean] Remove NM dnsmasq override"
file:
path: "{{ sno_nm_dnsmasq_conf }}"
state: absent
become: true
failed_when: false
- name: "[clean] Restart NetworkManager to drop local DNS overrides"
systemd:
name: NetworkManager
state: restarted
become: true
failed_when: false
- name: "[clean] Reload libvirt dnsmasq"
shell: |
sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true
changed_when: true
failed_when: false
- name: "[clean] Cleanup complete"
debug:
msg: >-
Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}.
Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`
around lines 114 - 124, The cleanup in the `clean.yml` task only reloads libvirt
dnsmasq, but it leaves the NetworkManager dnsmasq override created by
`update-dns.yml` in place. Update the `[clean]` flow to also remove `{{
sno_nm_dnsmasq_conf }}` and then reload or restart `NetworkManager` so stale
local DNS entries for `api`, `api-int`, and `*.apps` are cleared. Use the
existing cleanup block near `[clean] Reload libvirt dnsmasq` and `[clean]
Cleanup complete` to add this DNS teardown step.

Comment on lines +130 to +137
- name: "[create-vms] Add DNS entries to libvirt network"
shell: |
sudo virsh net-update {{ sno_libvirt_network }} add dns-host \
"<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \
--live --config
loop: "{{ sno_new_nodes }}"
failed_when: false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don't mask virsh net-update add dns-host failures.

This call is the durable source of truth for dnsmasq. If it fails here, the later lineinfile fallback only fixes the current addnhosts file, and a later net-update/network restart can drop the hostname while this phase still reports success.

Suggested fix
 - name: "[create-vms] Add DNS entries to libvirt network"
   shell: |
     sudo virsh net-update {{ sno_libvirt_network }} add dns-host \
       "<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \
       --live --config
   loop: "{{ sno_new_nodes }}"
-  failed_when: false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: "[create-vms] Add DNS entries to libvirt network"
shell: |
sudo virsh net-update {{ sno_libvirt_network }} add dns-host \
"<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \
--live --config
loop: "{{ sno_new_nodes }}"
failed_when: false
- name: "[create-vms] Add DNS entries to libvirt network"
shell: |
sudo virsh net-update {{ sno_libvirt_network }} add dns-host \
"<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \
--live --config
loop: "{{ sno_new_nodes }}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`
around lines 130 - 137, The "[create-vms] Add DNS entries to libvirt network"
task currently hides failures by setting failed_when: false on the virsh
net-update add dns-host call. Remove the failure masking so the task in
create-vms.yml fails when virsh net-update cannot update dns-host, and keep the
existing loop over sno_new_nodes so each hostname update is validated. If a
fallback is needed, handle it explicitly in a separate step rather than allowing
this task to report success when the libvirt DNS update did not apply.

Comment thread deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml Outdated
Comment on lines +91 to +111
- name: "[verify] Assert 3 nodes"
assert:
that: (verify_nodes.stdout_lines | length) >= 3
fail_msg: "Expected 3+ nodes, found {{ verify_nodes.stdout_lines | length }}"

- name: "[verify] Assert HA topology"
assert:
that: verify_topo.stdout == "HighlyAvailable"
fail_msg: "Expected HighlyAvailable, got {{ verify_topo.stdout }}"

- name: "[verify] Assert 3 etcd members"
assert:
that: verify_etcd.stdout | trim | int >= 3
fail_msg: "Expected 3+ etcd pods, found {{ verify_etcd.stdout | trim }}"

- name: "[verify] SNO to 3-node transition verified successfully"
debug:
msg: >-
MIGRATION COMPLETE: {{ verify_nodes.stdout_lines | length }} nodes,
{{ verify_etcd.stdout | trim }} etcd members,
topology={{ verify_topo.stdout }}, CVO={{ verify_cvo.stdout }} replicas.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Final verification never proves the three nodes are actually Ready.

Line 93 only counts rows from oc get nodes --no-headers, so a cluster with NotReady control-plane nodes still reaches the success path at Lines 106-111. Assert readiness explicitly before declaring the migration complete.

Suggested fix
 - name: "[verify] Assert 3 nodes"
   assert:
-    that: (verify_nodes.stdout_lines | length) >= 3
-    fail_msg: "Expected 3+ nodes, found {{ verify_nodes.stdout_lines | length }}"
+    that:
+      - (verify_nodes.stdout_lines | length) >= 3
+      - (verify_nodes.stdout_lines | select('search', '\sReady(?:,|$)') | list | length) >= 3
+    fail_msg: "Expected 3 Ready nodes, got:\n{{ verify_nodes.stdout }}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: "[verify] Assert 3 nodes"
assert:
that: (verify_nodes.stdout_lines | length) >= 3
fail_msg: "Expected 3+ nodes, found {{ verify_nodes.stdout_lines | length }}"
- name: "[verify] Assert HA topology"
assert:
that: verify_topo.stdout == "HighlyAvailable"
fail_msg: "Expected HighlyAvailable, got {{ verify_topo.stdout }}"
- name: "[verify] Assert 3 etcd members"
assert:
that: verify_etcd.stdout | trim | int >= 3
fail_msg: "Expected 3+ etcd pods, found {{ verify_etcd.stdout | trim }}"
- name: "[verify] SNO to 3-node transition verified successfully"
debug:
msg: >-
MIGRATION COMPLETE: {{ verify_nodes.stdout_lines | length }} nodes,
{{ verify_etcd.stdout | trim }} etcd members,
topology={{ verify_topo.stdout }}, CVO={{ verify_cvo.stdout }} replicas.
- name: "[verify] Assert 3 nodes"
assert:
that:
- (verify_nodes.stdout_lines | length) >= 3
- (verify_nodes.stdout_lines | select('search', '\sReady(?:,|$)') | list | length) >= 3
fail_msg: "Expected 3 Ready nodes, got:\n{{ verify_nodes.stdout }}"
- name: "[verify] Assert HA topology"
assert:
that: verify_topo.stdout == "HighlyAvailable"
fail_msg: "Expected HighlyAvailable, got {{ verify_topo.stdout }}"
- name: "[verify] Assert 3 etcd members"
assert:
that: verify_etcd.stdout | trim | int >= 3
fail_msg: "Expected 3+ etcd pods, found {{ verify_etcd.stdout | trim }}"
- name: "[verify] SNO to 3-node transition verified successfully"
debug:
msg: >-
MIGRATION COMPLETE: {{ verify_nodes.stdout_lines | length }} nodes,
{{ verify_etcd.stdout | trim }} etcd members,
topology={{ verify_topo.stdout }}, CVO={{ verify_cvo.stdout }} replicas.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`
around lines 91 - 111, The final verification in the [verify] task only checks
the number of node rows and can pass even if some nodes are NotReady. Update the
existing node verification around verify_nodes so it explicitly asserts all
three nodes are Ready before the migration success message is emitted, and keep
the HA/etcd checks as-is. Use the verify_nodes assertion block and the "[verify]
SNO to 3-node transition verified successfully" debug step to locate where to
add the readiness condition.

Comment thread deploy/openshift-clusters/scripts/deploy-cluster.sh
@eggfoobar

Copy link
Copy Markdown
Contributor Author

/label tide/merge-method-squash

@openshift-ci openshift-ci Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 30, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deploy/openshift-clusters/scripts/deploy-cluster.sh (1)

52-63: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Keep --help aligned with the accepted topologies.

After allowing sno here, the file header and the -h|--help branch still advertise only arbiter|fencing. deploy-cluster.sh --help now contradicts the actual interface.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/deploy-cluster.sh` around lines 52 - 63,
Update the CLI documentation in deploy-cluster.sh so it matches the validation
logic that now accepts sno. In the script header and the -h|--help handling,
replace any arbiter|fencing-only wording with arbiter|fencing|sno, and ensure
the usage/examples for topology reference the same accepted values as the
TOPOLOGY validation block.
♻️ Duplicate comments (1)
deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml (1)

15-31: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Still fail the handler after the last oc project retry.

Line 19 keeps the task successful even when all retries are exhausted, and Lines 22-31 only log a warning. That still hides real kubeconfig/auth/API failures and lets the role continue in a bad state.

Proposed fix
 - name: Run oc project
   command: oc --kubeconfig="{{ kubeconfig_path }}" project openshift-machine-api
   when:
     - kubeconfig_path is defined
     - kubeconfig_stat.stat.exists | default(false)
   changed_when: false
   register: oc_project_result
   retries: 5
   delay: 15
   until: oc_project_result.rc == 0
-  failed_when: false
+  failed_when: oc_project_result.rc != 0
   listen: Set OCP project
-
-- name: Warn if oc project failed after all retries
-  debug:
-    msg: >-
-      WARNING: 'oc project openshift-machine-api' still failing after all retries
-      (rc={{ oc_project_result.rc | default('n/a') }}).
-      stderr: {{ oc_project_result.stderr | default('') | trim }}
-  when:
-    - kubeconfig_path is defined
-    - kubeconfig_stat.stat.exists | default(false)
-    - oc_project_result.rc | default(0) != 0
-  listen: Set OCP project
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml`
around lines 15 - 31, The handler currently masks a failed oc_project_result by
setting failed_when: false and only emitting a warning in the follow-up debug
task, so the role can continue after all retries are exhausted. Update the Set
OCP project handler logic to use oc_project_result in a failure condition after
the retries are done, and keep the warning only as additional context; ensure
the failure is tied to the existing oc_project_result check and the Set OCP
project handler rather than just logging in the debug task.
🧹 Nitpick comments (2)
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the repository-standard Bash shebang.

This new script uses #!/bin/bash, but the repo rule for *.sh is #!/usr/bin/bash.

As per coding guidelines, **/*.sh: Shell scripts should use #!/usr/bin/bash shebang.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` around lines 1 - 2, The
new shell script uses the wrong Bash shebang; update the top of sno-to-3node.sh
to match the repository-standard `#!/usr/bin/bash` used for `*.sh` scripts. Keep
the rest of the script unchanged and make sure the entrypoint shebang is
consistent with the repo’s shell script convention.

Source: Coding guidelines

deploy/openshift-clusters/scripts/clean-mutable-topology.sh (1)

1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the repository-standard Bash shebang.

This new script uses #!/bin/bash, but the repo rule for *.sh is #!/usr/bin/bash.

As per coding guidelines, **/*.sh: Shell scripts should use #!/usr/bin/bash shebang.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 -
2, The new shell script is using the wrong Bash interpreter path; update the
shebang in the script header to match the repository-standard format used for
.sh files. Locate the script by its top-level shebang and replace the current
Bash path with the repo-required /usr/bin/bash form so it aligns with the shell
script convention.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/Makefile`:
- Around line 79-84: The new deploy targets sno-ipi and sno-agent are missing
from the make help output, so update the help/documentation logic in
deploy/Makefile to advertise them alongside sno-to-3node. Add these targets to
the same help entry list or help-generation block used by make help so users can
discover the new entrypoints from the main CLI.

In `@deploy/openshift-clusters/clean-mutable-topology.yml`:
- Around line 35-53: The topology guard in the live cluster check is too
permissive because it allows cleanup to continue when the detected
controlPlaneTopology is Unknown. Update the logic around the live_topology check
and the Abort if cluster is HA and force_cleanup is not set step so destructive
cleanup only proceeds when the topology is explicitly SingleReplica, or when
force_cleanup is true; otherwise fail closed on Unknown or any non-explicit
result. Use the existing live_topology, clean.yml, and controlPlaneTopology
handling to make the condition explicit and prevent master-1/master-2 deletion
unless the topology is confirmed safe.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 127-130: The cleanup task still only reloads NetworkManager, so
the old dnsmasq state may survive and keep stale DNS overrides. Update the
cleanup step in the NetworkManager reset flow to use a full restart after
removing `{{ sno_nm_dnsmasq_conf }}`, matching the behavior used in
`update-dns.yml`; keep the change in the `clean.yml` task that manages
NetworkManager so the stale `api`/`api-int`/`*.apps` answers are fully cleared.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`:
- Around line 2-9: The stale VM cleanup in the create-vms task is missing the
OVMF-specific NVRAM removal flag, which can leave domain state behind and
interfere with recreating the VM. Update the `virsh undefine` call in the
`"[create-vms] Cleanup stale VMs"` task to include `--nvram`, keeping the rest
of the cleanup flow in `create-vms.yml` unchanged.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 155-167: The DNS verification in the preflight task is using
substring matching, which can incorrectly accept a different IP when one answer
prefixes another. Update the checks in the preflight shell task and the
`dns_apiint_verify` validation to compare against trimmed `stdout_lines` entries
instead of `in dns_apiint.stdout`, so the `dig` result for `api-int` must match
the exact `sno_master0_ip` value.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`:
- Around line 47-78: The verification step in the cluster operator summary only
reports degraded/unavailable operators and never blocks completion, so the
migration can still finish in a bad state. Update the verify task that runs the
oc get co summary and stores it in verify_co to explicitly fail when any
operator is Degraded or Available=False, using the existing degraded and
unavailable lists as the gate, so the role stops before emitting MIGRATION
COMPLETE.

---

Outside diff comments:
In `@deploy/openshift-clusters/scripts/deploy-cluster.sh`:
- Around line 52-63: Update the CLI documentation in deploy-cluster.sh so it
matches the validation logic that now accepts sno. In the script header and the
-h|--help handling, replace any arbiter|fencing-only wording with
arbiter|fencing|sno, and ensure the usage/examples for topology reference the
same accepted values as the TOPOLOGY validation block.

---

Duplicate comments:
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml`:
- Around line 15-31: The handler currently masks a failed oc_project_result by
setting failed_when: false and only emitting a warning in the follow-up debug
task, so the role can continue after all retries are exhausted. Update the Set
OCP project handler logic to use oc_project_result in a failure condition after
the retries are done, and keep the warning only as additional context; ensure
the failure is tied to the existing oc_project_result check and the Set OCP
project handler rather than just logging in the debug task.

---

Nitpick comments:
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-2: The new shell script is using the wrong Bash interpreter
path; update the shebang in the script header to match the repository-standard
format used for .sh files. Locate the script by its top-level shebang and
replace the current Bash path with the repo-required /usr/bin/bash form so it
aligns with the shell script convention.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Around line 1-2: The new shell script uses the wrong Bash shebang; update the
top of sno-to-3node.sh to match the repository-standard `#!/usr/bin/bash` used
for `*.sh` scripts. Keep the rest of the script unchanged and make sure the
entrypoint shebang is consistent with the repo’s shell script convention.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bd4f4653-4596-463f-be41-02bedea644dd

📥 Commits

Reviewing files that changed from the base of the PR and between 52f92a5 and e1bf815.

📒 Files selected for processing (27)
  • deploy/Makefile
  • deploy/openshift-clusters/clean-mutable-topology.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/files/.gitignore
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
  • deploy/openshift-clusters/scripts/clean-mutable-topology.sh
  • deploy/openshift-clusters/scripts/deploy-cluster.sh
  • deploy/openshift-clusters/scripts/sno-to-3node.sh
  • deploy/openshift-clusters/sno-to-3node.yml

Comment thread deploy/Makefile
Comment on lines +79 to +84
sno-ipi:
@./openshift-clusters/scripts/deploy-cluster.sh --topology sno --method ipi

sno-agent:
@./openshift-clusters/scripts/deploy-cluster.sh --topology sno --method agent

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Document the new SNO deploy targets in make help.

sno-ipi and sno-agent are added here, but the help output only advertises sno-to-3node. That makes the new entrypoints hard to discover from the main CLI.

Suggested help update
 	`@echo` "  fencing-ipi          - Deploy fencing IPI cluster (non-interactive)"
 	`@echo` "  fencing-agent        - Deploy fencing Agent cluster (non-interactive) (WIP Experimental)"
 	`@echo` "  arbiter-ipi          - Deploy arbiter IPI cluster (non-interactive)"
 	`@echo` "  arbiter-agent        - Deploy arbiter Agent cluster (non-interactive)"
+	`@echo` "  sno-ipi              - Deploy SNO IPI cluster (non-interactive)"
+	`@echo` "  sno-agent            - Deploy SNO Agent cluster (non-interactive)"
 	`@echo` "  arbiter-kcli         - Deploy arbiter cluster using kcli (non-interactive)"
 	`@echo` "  fencing-kcli         - Deploy fencing cluster using kcli (non-interactive)"
 	`@echo` "  fencing-assisted     - Deploy hub + spoke TNF cluster via assisted installer"
 	`@echo` "  sno-to-3node         - Transition existing SNO cluster to 3-node HA (platform:none)"

Also applies to: 135-145

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/Makefile` around lines 79 - 84, The new deploy targets sno-ipi and
sno-agent are missing from the make help output, so update the
help/documentation logic in deploy/Makefile to advertise them alongside
sno-to-3node. Add these targets to the same help entry list or help-generation
block used by make help so users can discover the new entrypoints from the main
CLI.

Comment on lines +35 to +53
- name: Check live cluster topology
shell: |
KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
oc get infrastructure cluster \
-o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown"
register: live_topology
changed_when: false
failed_when: false
ignore_errors: true

- name: Abort if cluster is HA and force_cleanup is not set
ansible.builtin.fail:
msg: >-
Cluster controlPlaneTopology is '{{ live_topology.stdout | trim }}'.
Refusing to remove master-1 and master-2 from a live HA cluster.
Re-run with -e "force_cleanup=true" to override.
when:
- live_topology.stdout | trim == "HighlyAvailable"
- not (force_cleanup | default(false) | bool)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Fail closed when topology detection is unavailable.

This still proceeds on Unknown. If oc/kubeconfig access is broken on an already-HA cluster, the playbook skips the guard and drops into clean.yml, which deletes master-1 and master-2 anyway. Destructive cleanup should only run when the topology is explicitly SingleReplica, or when force_cleanup=true is set.

Suggested guard change
-    - name: Abort if cluster is HA and force_cleanup is not set
+    - name: Abort unless cluster is SingleReplica or force_cleanup is set
       ansible.builtin.fail:
         msg: >-
           Cluster controlPlaneTopology is '{{ live_topology.stdout | trim }}'.
-          Refusing to remove master-1 and master-2 from a live HA cluster.
+          Refusing to remove master-1 and master-2 unless the live cluster is
+          explicitly SingleReplica.
           Re-run with -e "force_cleanup=true" to override.
       when:
-        - live_topology.stdout | trim == "HighlyAvailable"
         - not (force_cleanup | default(false) | bool)
+        - live_topology.stdout | trim != "SingleReplica"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Check live cluster topology
shell: |
KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
oc get infrastructure cluster \
-o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown"
register: live_topology
changed_when: false
failed_when: false
ignore_errors: true
- name: Abort if cluster is HA and force_cleanup is not set
ansible.builtin.fail:
msg: >-
Cluster controlPlaneTopology is '{{ live_topology.stdout | trim }}'.
Refusing to remove master-1 and master-2 from a live HA cluster.
Re-run with -e "force_cleanup=true" to override.
when:
- live_topology.stdout | trim == "HighlyAvailable"
- not (force_cleanup | default(false) | bool)
- name: Abort unless cluster is SingleReplica or force_cleanup is set
ansible.builtin.fail:
msg: >-
Cluster controlPlaneTopology is '{{ live_topology.stdout | trim }}'.
Refusing to remove master-1 and master-2 unless the live cluster is
explicitly SingleReplica.
Re-run with -e "force_cleanup=true" to override.
when:
- not (force_cleanup | default(false) | bool)
- live_topology.stdout | trim != "SingleReplica"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 35 - 53,
The topology guard in the live cluster check is too permissive because it allows
cleanup to continue when the detected controlPlaneTopology is Unknown. Update
the logic around the live_topology check and the Abort if cluster is HA and
force_cleanup is not set step so destructive cleanup only proceeds when the
topology is explicitly SingleReplica, or when force_cleanup is true; otherwise
fail closed on Unknown or any non-explicit result. Use the existing
live_topology, clean.yml, and controlPlaneTopology handling to make the
condition explicit and prevent master-1/master-2 deletion unless the topology is
confirmed safe.

Comment on lines +127 to +130
- name: "[clean] Reload NetworkManager to clear stale DNS overrides"
systemd:
name: NetworkManager
state: reloaded

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Restart NetworkManager here too.

update-dns.yml had to restart NetworkManager to apply {{ sno_nm_dnsmasq_conf }}, but cleanup only reloads after deleting that file. If the embedded dnsmasq process keeps the old config across reload, stale api/api-int/*.apps answers survive cleanup and poison the next run.

Proposed fix
 - name: "[clean] Reload NetworkManager to clear stale DNS overrides"
   systemd:
     name: NetworkManager
-    state: reloaded
+    state: restarted
   become: true
   failed_when: false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: "[clean] Reload NetworkManager to clear stale DNS overrides"
systemd:
name: NetworkManager
state: reloaded
- name: "[clean] Reload NetworkManager to clear stale DNS overrides"
systemd:
name: NetworkManager
state: restarted
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`
around lines 127 - 130, The cleanup task still only reloads NetworkManager, so
the old dnsmasq state may survive and keep stale DNS overrides. Update the
cleanup step in the NetworkManager reset flow to use a full restart after
removing `{{ sno_nm_dnsmasq_conf }}`, matching the behavior used in
`update-dns.yml`; keep the change in the `clean.yml` task that manages
NetworkManager so the stale `api`/`api-int`/`*.apps` answers are fully cleared.

Comment on lines +2 to +9
- name: "[create-vms] Cleanup stale VMs"
shell: |
sudo virsh destroy {{ item.name }} 2>/dev/null || true
sudo virsh undefine {{ item.name }} --remove-all-storage 2>/dev/null || true
sudo rm -f {{ sno_vm_image_dir }}/{{ item.name }}.qcow2
loop: "{{ sno_new_nodes }}"
failed_when: false
changed_when: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/bash
set -euo pipefail

rg -n -C3 'virsh undefine|loader=.*/OVMF|--nvram' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node

Repository: openshift-eng/two-node-toolbox

Length of output: 4523


🌐 Web query:

libvirt virsh undefine --nvram OVMF UEFI domain behavior

💡 Result:

When using the virsh undefine command on a domain configured with UEFI (OVMF) firmware, the libvirt driver requires explicit instruction regarding the existing NVRAM variable store file [1][2]. If a domain has an associated NVRAM file and neither the --nvram nor the --keep-nvram flag is provided, the undefine operation will fail with an error such as "Requested operation is not valid: cannot undefine domain with nvram" [3][2][4]. The behavior of these flags is as follows: 1. --nvram: This flag instructs libvirt to delete the NVRAM variable store file (typically defined in the domain XML under /domain/os/nvram) upon undefining the domain [2][4]. This action is performed regardless of whether the domain is currently active or inactive, ensuring that the backing file is removed from the host's filesystem [5][2][6]. 2. --keep-nvram: This flag instructs libvirt to remove the domain's configuration from libvirt management while preserving the existing NVRAM file on the host's filesystem [2][4]. This mechanism was implemented to prevent the accidental loss of UEFI state (such as boot order, secure boot keys, or other NVRAM-stored variables) when a virtual machine is undefined, while also providing a safe mechanism for cleanup [1][2][7]. If you encounter difficulty deleting a UEFI-based VM, explicitly including --nvram in your command is the standard resolution [8][9].

Citations:


🏁 Script executed:

sed -n '50,75p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 1159


Add --nvram to the stale-VM undefine. virsh undefine needs it for OVMF guests, and without it the stale domain/NVRAM state can block the later VM create. deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml:5

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`
around lines 2 - 9, The stale VM cleanup in the create-vms task is missing the
OVMF-specific NVRAM removal flag, which can leave domain state behind and
interfere with recreating the VM. Update the `virsh undefine` call in the
`"[create-vms] Cleanup stale VMs"` task to include `--nvram`, keeping the rest
of the cleanup flow in `create-vms.yml` unchanged.

Comment on lines +155 to +167
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout
changed_when: true

- name: "[preflight] Verify api-int DNS after fix"
shell: |
dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1
register: dns_apiint_verify
changed_when: false
failed_when: sno_master0_ip not in dns_apiint_verify.stdout
retries: 3
delay: 2
until: sno_master0_ip in dns_apiint_verify.stdout
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Match DNS answers exactly, not by substring.

in dns_apiint.stdout can accept the wrong IP when one address is a prefix of another. Compare against trimmed stdout_lines instead.

Proposed fix
-  when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout
+  when: dns_apiint.stdout | trim == "" or sno_master0_ip not in (dns_apiint.stdout_lines | map('trim') | list)
@@
-  failed_when: sno_master0_ip not in dns_apiint_verify.stdout
+  failed_when: sno_master0_ip not in (dns_apiint_verify.stdout_lines | map('trim') | list)
@@
-  until: sno_master0_ip in dns_apiint_verify.stdout
+  until: sno_master0_ip in (dns_apiint_verify.stdout_lines | map('trim') | list)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout
changed_when: true
- name: "[preflight] Verify api-int DNS after fix"
shell: |
dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1
register: dns_apiint_verify
changed_when: false
failed_when: sno_master0_ip not in dns_apiint_verify.stdout
retries: 3
delay: 2
until: sno_master0_ip in dns_apiint_verify.stdout
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in (dns_apiint.stdout_lines | map('trim') | list)
changed_when: true
- name: "[preflight] Verify api-int DNS after fix"
shell: |
dig +short api-int.{{ sno_cluster_domain }} `@192.168.111.1`
register: dns_apiint_verify
changed_when: false
failed_when: sno_master0_ip not in (dns_apiint_verify.stdout_lines | map('trim') | list)
retries: 3
delay: 2
until: sno_master0_ip in (dns_apiint_verify.stdout_lines | map('trim') | list)
when: dns_apiint.stdout | trim == "" or sno_master0_ip not in (dns_apiint.stdout_lines | map('trim') | list)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`
around lines 155 - 167, The DNS verification in the preflight task is using
substring matching, which can incorrectly accept a different IP when one answer
prefixes another. Update the checks in the preflight shell task and the
`dns_apiint_verify` validation to compare against trimmed `stdout_lines` entries
instead of `in dns_apiint.stdout`, so the `dig` result for `api-int` must match
the exact `sno_master0_ip` value.

Comment on lines +47 to +78
- name: "[verify] Get cluster operator summary"
shell: |
oc get co -o json | python3 -c "
import json, sys
data = json.load(sys.stdin)
total = len(data['items'])
degraded = []
unavailable = []
for co in data['items']:
name = co['metadata']['name']
conds = {c['type']: c['status'] for c in co.get('status', {}).get('conditions', [])}
if conds.get('Degraded') == 'True':
degraded.append(name)
if conds.get('Available') == 'False':
unavailable.append(name)
print(f'Total COs: {total}')
if degraded:
print(f'Degraded: {chr(44).join(degraded)}')
else:
print('Degraded: none')
if unavailable:
print(f'Unavailable: {chr(44).join(unavailable)}')
else:
print('Unavailable: none')
"
environment: "{{ sno_oc_env }}"
register: verify_co
changed_when: false

- name: "[verify] Cluster operators"
debug:
msg: "{{ verify_co.stdout }}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Fail the migration when cluster operators are still degraded or unavailable.

This block only prints the degraded/unavailable operator list, then the role still emits MIGRATION COMPLETE. Since mco-rollout.yml is best-effort and main.yml ends after this phase, there is no later hard gate that proves post-transition convergence.

Proposed fix
 - name: "[verify] Cluster operators"
   debug:
     msg: "{{ verify_co.stdout }}"
 
+- name: "[verify] Assert cluster operators healthy"
+  assert:
+    that:
+      - "'Degraded: none' in verify_co.stdout"
+      - "'Unavailable: none' in verify_co.stdout"
+    fail_msg: "{{ verify_co.stdout }}"
+
 - name: "[verify] Get MCP master status"
   shell: |
     oc get mcp master --no-headers

Also applies to: 111-116

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`
around lines 47 - 78, The verification step in the cluster operator summary only
reports degraded/unavailable operators and never blocks completion, so the
migration can still finish in a bad state. Update the verify task that runs the
oc get co summary and stores it in verify_co to explicitly fail when any
operator is Degraded or Available=False, using the existing degraded and
unavailable lists as the gate, so the role stops before emitting MIGRATION
COMPLETE.

eggfoobar and others added 2 commits July 1, 2026 10:43
Add a new Ansible role and make target to perform a one-shot migration
of a platform:None SNO cluster to a 3-node HA cluster. The role creates
two new control-plane VMs, boots them with RHCOS via coreos-installer
ISO ignition embed, waits for nodes to join and etcd to scale, then
transitions the topology to HighlyAvailable.

Usage: make sno-to-3node

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ehila <ehila@redhat.com>
@eggfoobar eggfoobar force-pushed the add-migration-support branch from e1bf815 to f8eef94 Compare July 1, 2026 15:15

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml (1)

127-132: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Still reloading instead of restarting NetworkManager.

update-dns.yml restarts NetworkManager to apply sno_nm_dnsmasq_conf, but this cleanup step only reloads (state: reloaded) after deleting the file, which was previously flagged as insufficient to guarantee the embedded dnsmasq process drops the stale config.

♻️ Proposed fix
 - name: "[clean] Reload NetworkManager to clear stale DNS overrides"
   systemd:
     name: NetworkManager
-    state: reloaded
+    state: restarted
   become: true
   failed_when: false
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`
around lines 127 - 132, The cleanup task still uses a NetworkManager reload,
which is not enough to ensure the embedded dnsmasq process drops the removed DNS
override. Update the task named “[clean] Reload NetworkManager to clear stale
DNS overrides” in clean.yml to perform a restart via the systemd task instead of
a reload, matching the behavior used in update-dns.yml and keeping the existing
become and failed_when handling.
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml (1)

47-90: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Cluster operator and MCP health are still not gated before declaring success.

The [verify] Cluster operators and [verify] MCP status blocks only print verify_co.stdout / verify_mcp.stdout; there is no assert that fails the play when operators are Degraded/Unavailable or the MCP is not Updated. main.yml treats verify.yml as the terminal "Phase 9" step, so the role can still report a successful migration with degraded operators or a stuck MCP rollout.

♻️ Proposed fix
 - name: "[verify] Cluster operators"
   debug:
     msg: "{{ verify_co.stdout }}"
 
+- name: "[verify] Assert cluster operators healthy"
+  assert:
+    that:
+      - "'Degraded: none' in verify_co.stdout"
+      - "'Unavailable: none' in verify_co.stdout"
+    fail_msg: "{{ verify_co.stdout }}"
+
 - name: "[verify] Get MCP master status"
   shell: |
     oc get mcp master --no-headers
   environment: "{{ sno_oc_env }}"
   register: verify_mcp
   changed_when: false
 
 - name: "[verify] MCP status"
   debug:
     msg: "{{ verify_mcp.stdout }}"
+
+- name: "[verify] Assert MCP updated"
+  assert:
+    that: "'True' in verify_mcp.stdout"
+    fail_msg: "MCP master not fully updated:\n{{ verify_mcp.stdout }}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`
around lines 47 - 90, The verify tasks currently only report Cluster Operator
and MCP output, but do not fail the play when health is bad. Update the verify
flow in the existing `verify_co`/`verify_mcp` checks so it asserts no operators
are Degraded or Unavailable and that the master MCP is fully updated/ready
before success is declared. Use the existing `verify_co` and `verify_mcp`
registered results to add explicit failure conditions after the `"[verify]
Cluster operators"` and `"[verify] MCP status"` steps.
🧹 Nitpick comments (3)
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)

1-1: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Shebang should be #!/usr/bin/bash.

As per coding guidelines, **/*.sh files "should use #!/usr/bin/bash shebang". This script uses #!/bin/bash.

diff
-#!/bin/bash
+#!/usr/bin/bash
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` at line 1, Update the
script header to use the required bash interpreter path: this file currently
starts with the wrong shebang, so replace the existing `#!/bin/bash` at the top
of `sno-to-3node.sh` with `#!/usr/bin/bash` to match the shell script guideline.

Source: Coding guidelines

deploy/openshift-clusters/scripts/clean-mutable-topology.sh (2)

1-22: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate logic with sno-to-3node.sh.

This script is identical to deploy/openshift-clusters/scripts/sno-to-3node.sh aside from the playbook filename. Consider extracting the shared instance/inventory validation and invocation logic into a common helper (e.g., alongside common.sh) to avoid divergence over time.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 -
22, The validation and ansible invocation logic in clean-mutable-topology.sh is
duplicated in sno-to-3node.sh, so extract the shared checks and playbook runner
into a common helper sourced by both scripts. Move the instance-dir and
inventory.ini validation plus the cd/ansible-playbook flow into a reusable
function in common.sh or a nearby shared script, and keep only the
playbook-specific filename in each wrapper. Use the existing get_node_dir and
DEPLOY_DIR setup to preserve behavior while removing the duplicate shell logic.

1-1: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Shebang should be #!/usr/bin/bash.

As per coding guidelines, **/*.sh files "should use #!/usr/bin/bash shebang". This script uses #!/bin/bash.

diff
-#!/bin/bash
+#!/usr/bin/bash
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` at line 1, The
script uses the wrong shebang and should follow the shell script guideline by
switching the top-of-file interpreter line in clean-mutable-topology.sh from the
current bash path to usr/bin/bash. Update the shebang only, keeping the rest of
the script unchanged.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/openshift-clusters/clean-mutable-topology.yml`:
- Around line 16-39: The cleanup probes are rebuilding KUBECONFIG from
dev_scripts_path and sno_cluster_name instead of honoring the sno_kubeconfig
override, which can point them at the wrong cluster. Update the two shell probes
in the cleanup workflow to use sno_kubeconfig when it is defined, and only fall
back to the existing computed kubeconfig path when the override is absent. Keep
the existing detection logic in the domain and topology steps, but make sure
both oc invocations reference the same kubeconfig source consistently.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 122-124: The preflight DNS verification is hardcoded to query a
fixed dnsmasq server, which can point at the wrong libvirt network. Update the
`dig` checks in the preflight tasks to use the DNS server associated with `{{
sno_libvirt_network }}` instead of `@192.168.111.1`, and make the same change in
both the `api-int` and related verification steps so the check matches the
repair path.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2`:
- Around line 8-12: The `/etc/master.ign` entry in `auto-install.ign.j2` is too
permissive for rendered ignition content that may include sensitive
control-plane credentials. Update the file mode for this object to restrict
access to the installer service only, keeping the change localized to the
`/etc/master.ign` stanza and preserving the existing `contents.source` handling.

---

Duplicate comments:
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 127-132: The cleanup task still uses a NetworkManager reload,
which is not enough to ensure the embedded dnsmasq process drops the removed DNS
override. Update the task named “[clean] Reload NetworkManager to clear stale
DNS overrides” in clean.yml to perform a restart via the systemd task instead of
a reload, matching the behavior used in update-dns.yml and keeping the existing
become and failed_when handling.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`:
- Around line 47-90: The verify tasks currently only report Cluster Operator and
MCP output, but do not fail the play when health is bad. Update the verify flow
in the existing `verify_co`/`verify_mcp` checks so it asserts no operators are
Degraded or Unavailable and that the master MCP is fully updated/ready before
success is declared. Use the existing `verify_co` and `verify_mcp` registered
results to add explicit failure conditions after the `"[verify] Cluster
operators"` and `"[verify] MCP status"` steps.

---

Nitpick comments:
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-22: The validation and ansible invocation logic in
clean-mutable-topology.sh is duplicated in sno-to-3node.sh, so extract the
shared checks and playbook runner into a common helper sourced by both scripts.
Move the instance-dir and inventory.ini validation plus the cd/ansible-playbook
flow into a reusable function in common.sh or a nearby shared script, and keep
only the playbook-specific filename in each wrapper. Use the existing
get_node_dir and DEPLOY_DIR setup to preserve behavior while removing the
duplicate shell logic.
- Line 1: The script uses the wrong shebang and should follow the shell script
guideline by switching the top-of-file interpreter line in
clean-mutable-topology.sh from the current bash path to usr/bin/bash. Update the
shebang only, keeping the rest of the script unchanged.

In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Line 1: Update the script header to use the required bash interpreter path:
this file currently starts with the wrong shebang, so replace the existing
`#!/bin/bash` at the top of `sno-to-3node.sh` with `#!/usr/bin/bash` to match
the shell script guideline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5541d134-f639-4040-b324-49e50d05bf1c

📥 Commits

Reviewing files that changed from the base of the PR and between d3c5ba2 and f8eef94.

📒 Files selected for processing (23)
  • deploy/Makefile
  • deploy/openshift-clusters/clean-mutable-topology.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml
  • deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml
  • deploy/openshift-clusters/scripts/clean-mutable-topology.sh
  • deploy/openshift-clusters/scripts/sno-to-3node.sh
  • deploy/openshift-clusters/sno-to-3node.yml

Comment on lines +16 to +39
- name: Detect cluster domain from kubeconfig (best-effort)
shell: |
KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \
| sed 's|^https://api-int\.||; s|:6443$||'
register: detected_domain
changed_when: false
failed_when: false
ignore_errors: true

- name: Set cluster domain (use detected, variable override, or default)
set_fact:
sno_cluster_domain: >-
{{ sno_cluster_domain
if (sno_cluster_domain is defined and sno_cluster_domain)
else (detected_domain.stdout | trim
if (detected_domain.stdout is defined and detected_domain.stdout | trim)
else 'ostest.test.metalkube.org') }}

- name: Check live cluster topology
shell: |
KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
oc get infrastructure cluster \
-o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Honor sno_kubeconfig in the cleanup probes.

Both oc probes rebuild KUBECONFIG from dev_scripts_path/sno_cluster_name, but this role now exposes sno_kubeconfig as the override for non-standard installs. When that override is used, domain detection falls back and topology detection degrades to Unknown, so this cleanup path can make decisions against the wrong cluster state.

Suggested fix
+    - name: Resolve kubeconfig path for mutable-topology checks
+      ansible.builtin.set_fact:
+        effective_sno_kubeconfig: >-
+          {{ sno_kubeconfig
+             if (sno_kubeconfig is defined and sno_kubeconfig)
+             else (dev_scripts_path | default('openshift-metal3/dev-scripts')) ~ '/ocp/' ~ (sno_cluster_name | default('ostest')) ~ '/auth/kubeconfig' }}
+
     - name: Detect cluster domain from kubeconfig (best-effort)
       shell: |
-        KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
+        KUBECONFIG={{ effective_sno_kubeconfig | quote }} \
           oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \
           | sed 's|^https://api-int\.||; s|:6443$||'
@@
     - name: Check live cluster topology
       shell: |
-        KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
+        KUBECONFIG={{ effective_sno_kubeconfig | quote }} \
           oc get infrastructure cluster \
           -o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Detect cluster domain from kubeconfig (best-effort)
shell: |
KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \
| sed 's|^https://api-int\.||; s|:6443$||'
register: detected_domain
changed_when: false
failed_when: false
ignore_errors: true
- name: Set cluster domain (use detected, variable override, or default)
set_fact:
sno_cluster_domain: >-
{{ sno_cluster_domain
if (sno_cluster_domain is defined and sno_cluster_domain)
else (detected_domain.stdout | trim
if (detected_domain.stdout is defined and detected_domain.stdout | trim)
else 'ostest.test.metalkube.org') }}
- name: Check live cluster topology
shell: |
KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
oc get infrastructure cluster \
-o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown"
- name: Resolve kubeconfig path for mutable-topology checks
ansible.builtin.set_fact:
effective_sno_kubeconfig: >-
{{ sno_kubeconfig
if (sno_kubeconfig is defined and sno_kubeconfig)
else (dev_scripts_path | default('openshift-metal3/dev-scripts')) ~ '/ocp/' ~ (sno_cluster_name | default('ostest')) ~ '/auth/kubeconfig' }}
- name: Detect cluster domain from kubeconfig (best-effort)
shell: |
KUBECONFIG={{ effective_sno_kubeconfig | quote }} \
oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \
| sed 's|^https://api-int\.||; s|:6443$||'
register: detected_domain
changed_when: false
failed_when: false
ignore_errors: true
- name: Set cluster domain (use detected, variable override, or default)
set_fact:
sno_cluster_domain: >-
{{ sno_cluster_domain
if (sno_cluster_domain is defined and sno_cluster_domain)
else (detected_domain.stdout | trim
if (detected_domain.stdout is defined and detected_domain.stdout | trim)
else 'ostest.test.metalkube.org') }}
- name: Check live cluster topology
shell: |
KUBECONFIG={{ effective_sno_kubeconfig | quote }} \
oc get infrastructure cluster \
-o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 16 - 39,
The cleanup probes are rebuilding KUBECONFIG from dev_scripts_path and
sno_cluster_name instead of honoring the sno_kubeconfig override, which can
point them at the wrong cluster. Update the two shell probes in the cleanup
workflow to use sno_kubeconfig when it is defined, and only fall back to the
existing computed kubeconfig path when the override is absent. Keep the existing
detection logic in the domain and topology steps, but make sure both oc
invocations reference the same kubeconfig source consistently.

Comment on lines +122 to +124
- name: "[preflight] Verify api-int DNS on hypervisor"
shell: |
dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Stop querying a fixed DNS server here.

Both dig checks hardcode @192.168.111.1, but the repair path updates {{ sno_libvirt_network }}. On any non-default libvirt network, preflight can inspect the wrong dnsmasq instance and either repair unnecessarily or fail verification after a successful virsh net-update.

Also applies to: 158-160

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`
around lines 122 - 124, The preflight DNS verification is hardcoded to query a
fixed dnsmasq server, which can point at the wrong libvirt network. Update the
`dig` checks in the preflight tasks to use the DNS server associated with `{{
sno_libvirt_network }}` instead of `@192.168.111.1`, and make the same change in
both the `api-int` and related verification steps so the check matches the
repair path.

Comment on lines +8 to +12
"path": "/etc/master.ign",
"mode": 420,
"contents": {
"source": "data:text/plain;charset=utf-8;base64,{{ sno_master_ign_b64 }}"
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

Tighten /etc/master.ign permissions.

master.ign can carry the rendered control-plane ignition, including pull-secret and cluster credential material. 0644 makes it world-readable in the live ISO environment for no benefit; 0600 is enough for the installer service.

Suggested fix
-        "mode": 420,
+        "mode": 384,

As per coding guidelines, "Handle secrets and credentials securely."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"path": "/etc/master.ign",
"mode": 420,
"contents": {
"source": "data:text/plain;charset=utf-8;base64,{{ sno_master_ign_b64 }}"
}
"path": "/etc/master.ign",
"mode": 384,
"contents": {
"source": "data:text/plain;charset=utf-8;base64,{{ sno_master_ign_b64 }}"
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2`
around lines 8 - 12, The `/etc/master.ign` entry in `auto-install.ign.j2` is too
permissive for rendered ignition content that may include sensitive
control-plane credentials. Update the file mode for this object to restrict
access to the installer service only, keeping the change localized to the
`/etc/master.ign` stanza and preserving the existing `contents.source` handling.

Source: Coding guidelines

@eggfoobar eggfoobar force-pushed the add-migration-support branch from f8eef94 to c5d09b6 Compare July 1, 2026 16:33
pull rhcos-live.iso from payload if not present
install coreos-installer if not present

Signed-off-by: ehila <ehila@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml (1)

80-81: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Keep virsh undefine aligned with UEFI/NVRAM guests.

This is still missing --nvram; with the OVMF varstore configured by virt-install, undefine can fail and leave the old domain behind before the next create.

Suggested fix
     sudo virsh destroy "$VM_NAME" 2>/dev/null || true
-    sudo virsh undefine "$VM_NAME" 2>/dev/null || true
+    sudo virsh undefine "$VM_NAME" --nvram 2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`
around lines 80 - 81, Update the cleanup in the boot-nodes task so `virsh
undefine` handles UEFI/NVRAM guests created by `virt-install`; specifically, in
the `boot-nodes.yml` teardown block where `sudo virsh destroy` and `sudo virsh
undefine` run, add the NVRAM-aware option to the `virsh undefine` call so the
old domain is fully removed before recreation.
🧹 Nitpick comments (1)
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml (1)

60-68: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Preserve idempotent task status for the installer check.

This task reports changed even when coreos-installer is already installed. Add changed_when so reruns don’t show a false mutation.

Suggested fix
 - name: "[boot] Ensure coreos-installer is available"
   shell: |
     if ! command -v coreos-installer &>/dev/null; then
       echo "coreos-installer not found, installing via dnf..."
       sudo dnf install -y coreos-installer
     else
       echo "coreos-installer already installed: $(command -v coreos-installer)"
     fi
+  register: coreos_installer_check
+  changed_when: "'installing via dnf' in coreos_installer_check.stdout"

As per coding guidelines, deployments should be idempotent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`
around lines 60 - 68, The boot-nodes.yml task that checks for coreos-installer
is reporting changed even on no-op reruns. Update the shell task named "[boot]
Ensure coreos-installer is available" to set changed_when so it only reports
changed when the dnf install branch runs, and stays unchanged when command -v
coreos-installer already succeeds. Make sure the task remains idempotent while
preserving the current install behavior.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`:
- Around line 80-81: Update the cleanup in the boot-nodes task so `virsh
undefine` handles UEFI/NVRAM guests created by `virt-install`; specifically, in
the `boot-nodes.yml` teardown block where `sudo virsh destroy` and `sudo virsh
undefine` run, add the NVRAM-aware option to the `virsh undefine` call so the
old domain is fully removed before recreation.

---

Nitpick comments:
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`:
- Around line 60-68: The boot-nodes.yml task that checks for coreos-installer is
reporting changed even on no-op reruns. Update the shell task named "[boot]
Ensure coreos-installer is available" to set changed_when so it only reports
changed when the dnf install branch runs, and stays unchanged when command -v
coreos-installer already succeeds. Make sure the task remains idempotent while
preserving the current install behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2c26fbf7-9252-4826-94ce-3ec2d575b695

📥 Commits

Reviewing files that changed from the base of the PR and between f8eef94 and a8d15d1.

📒 Files selected for processing (1)
  • deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants