OCPEDGE-2498: Add migration support#83
Conversation
|
@eggfoobar: This pull request references OCPEDGE-2498 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: eggfoobar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThis PR adds a complete Ansible-based automation workflow to transition a single-node OpenShift (SNO) cluster to a 3-node control plane, plus reverse cleanup. It includes new Makefile targets, shell entrypoint scripts, top-level playbooks, and a ChangesSNO to 3-node transition automation
Estimated code review effort: 4 (Complex) | ~75 minutes Possibly related PRs
Suggested reviewers: Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
deploy/Makefile (1)
122-130: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winDocument the new
sno-ipiandsno-agenttargets inmake help.Lines 79-83 add both commands, but this help block only advertises
sno-to-3node.make helpwill miss the new entrypoints entirely.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/Makefile` around lines 122 - 130, Update the Makefile help output so the new `sno-ipi` and `sno-agent` targets are listed alongside the existing deployment entries in `make help`. Add matching `@echo` lines in the help block near the `sno-to-3node` entry, using the same wording/style as the other target descriptions so the new entrypoints are discoverable.deploy/openshift-clusters/scripts/deploy-cluster.sh (1)
62-69: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winReject the unsupported
sno + kclicombination explicitly.This validation now allows
--topology snofor every method, so a directdeploy-cluster.sh --topology sno --method kclifalls through tokcli-install.yml. The PR only adds SNO support through the install-dev/setup flow, so this opens an unsupported path instead of failing fast.Suggested fix
if [[ "${METHOD}" != "ipi" && "${METHOD}" != "agent" && "${METHOD}" != "kcli" ]]; then echo "Error: Invalid method '${METHOD}'. Must be 'ipi', 'agent', or 'kcli'." exit 1 fi + +if [[ "${TOPOLOGY}" == "sno" && "${METHOD}" == "kcli" ]]; then + echo "Error: Topology 'sno' only supports 'ipi' or 'agent'." + exit 1 +fiAlso applies to: 87-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/deploy-cluster.sh` around lines 62 - 69, The validation in deploy-cluster.sh currently permits every METHOD with TOPOLOGY=sno, which allows an unsupported sno + kcli path to continue into the kcli install flow. Update the argument checks around the TOPOLOGY/METHOD validation blocks to explicitly reject the unsupported combination before any install logic runs, and keep the failure consistent with the existing validation messages. Reference the validation section and the kcli-install path so the guard is placed where the topology and method are first checked.deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml (1)
26-32: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winInclude the SNO scenario in the missing-variable error.
Line 31 still only tells agent users to add
TNF_IPV4orTNA_IPV4. Now thatsnois a supported prefix on Line 43, a missingAGENT_E2E_TEST_SCENARIOfor SNO will produce the wrong remediation.Proposed fix
- AGENT_E2E_TEST_SCENARIO="TNF_IPV4" (fencing) or "TNA_IPV4" (arbiter) to - the config file. + AGENT_E2E_TEST_SCENARIO="TNF_IPV4" (fencing), "TNA_IPV4" (arbiter), or + "SNO_IPV4" (sno) to the config file.Also applies to: 43-43
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml` around lines 26 - 32, The missing-AGENT_E2E_TEST_SCENARIO failure message in the config validation task should be updated to include the SNO option now supported by the method-prefix logic. Adjust the fail message in the config validation flow so it lists the correct remediation for all supported prefixes, including the new sno case alongside the existing agent scenarios, and make sure the guidance matches the prefix handling used in the task that selects by method.
🧹 Nitpick comments (2)
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)
1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the repository-standard Bash shebang.
This new wrapper still uses
#!/bin/bash, but the repo shell-script rule requires#!/usr/bin/bash. As per coding guidelines, "Shell scripts should use#!/usr/bin/bashshebang".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` around lines 1 - 2, The wrapper script currently uses the non-standard Bash shebang, so update the script header to match the repository convention by changing the shebang at the top of sno-to-3node.sh to the repo-standard Bash path. Keep the rest of the script unchanged, and verify the file still starts with the expected shebang followed by set -euo pipefail.Source: Coding guidelines
deploy/openshift-clusters/scripts/clean-mutable-topology.sh (1)
1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the repository-standard Bash shebang.
This new wrapper still uses
#!/bin/bash, but the repo shell-script rule requires#!/usr/bin/bash. As per coding guidelines, "Shell scripts should use#!/usr/bin/bashshebang".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 - 2, The clean-mutable-topology.sh wrapper uses the wrong Bash shebang, so update the script header to match the repository standard. Replace the current shebang at the top of the file with the repo-required Bash interpreter line, keeping the rest of the script unchanged; this is the only fix needed in the script’s startup header.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deploy/openshift-clusters/clean-mutable-topology.yml`:
- Around line 35-39: The cleanup playbook currently imports
mutable-topology/sno-to-3node tasks/clean.yml unconditionally, which can destroy
a healthy HA control plane. Add a topology check in clean-mutable-topology.yml
before the import_role so it only runs when the cluster is not already HA, or
require an explicit force_cleanup=true override to proceed. Use the existing
import_role task as the entry point and gate it with a condition or precheck
that prevents destructive cleanup unless the override is set.
In
`@deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh`:
- Around line 35-40: This example hardcodes an x86_64 release image and leaves
out the ARM64-specific Metal3 overrides, so update the config example to
explicitly document how to switch to aarch64/Graviton. Use
OPENSHIFT_RELEASE_IMAGE as the payload architecture selector and add the
required Metal3 container override guidance (IRONIC_IMAGE, VBMC_IMAGE,
SUSHY_TOOLS_IMAGE) in the example so the install path is correct for ARM64
hypervisors. Refer to the config_sno_example.sh example block and keep the
existing installer/image-policy comments consistent with the new ARM64 notes.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 68-72: The cleanup task in clean.yml is deleting the original SNO
node’s api-int DNS entry, which can break the surviving cluster after cleanup.
Update the [clean] Remove api-int from libvirt network DNS task to target only
mutable-topology-added resources associated with sno_new_nodes, and avoid
removing the api-int record for sno_master0_ip in the libvirt network update
logic.
- Around line 37-43: The embedded Python in the cleanup task scripts is
indented, which causes the inline `python3 -c` snippets used in `clean.yml` to
fail before the DHCP cleanup logic runs. Update the scripts in the task blocks
that use `EXISTING_MAC` and the lease cleanup command so the Python content is
left-aligned or otherwise dedented, and keep the logic under the same task names
so `|| true` / `failed_when: false` no longer hides a parse failure.
- Around line 26-30: The per-node ISO cleanup in the [clean] task is using a
hostname-based filename that can collide across clusters on the same hypervisor.
Update the ISO naming scheme in the related ISO creation and cleanup flow to use
a cluster-scoped identifier, such as item.name or a sno_cluster_name-prefixed
name, and make the file path in the cleanup task match that same unique pattern.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`:
- Around line 27-32: The DNS cleanup in the `create-vms.yml` task is still
targeting `{{ item.ip }}`, so stale `dns-host` records for `master-1`/`master-2`
with old IPs are not removed before `virsh net-update add dns-host` runs. Update
the cleanup logic around the `virsh net-update` calls to delete entries by
hostname instead of by IP, and then tighten the `dig` verification so it checks
that the resolved answers actually contain `{{ item.ip }}` rather than only a
non-empty response.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml`:
- Around line 19-24: The MCS reachability check in the ignition task is
currently allowed to fail and continue, which defers a required configuration
error until boot. Update the existing "[ignition] Verify MCS reachability" step
so it aborts when the curl probe to the MCS health endpoint fails, and only keep
a non-failing path if you explicitly add a break-glass override. Use the
existing mcs_health probe logic as the place to enforce fail-fast behavior.
- Around line 13-18: The ignition generation step writes /tmp/master.ign but
never populates sno_master_ign_b64, so auto-install.ign.j2 cannot render
reliably. Update the relevant task flow in ignition.yml to read back the
generated master.ign and set sno_master_ign_b64 before the template that
consumes it, using the existing ignition generation and auto-install rendering
steps as the anchor points. Ensure the variable is produced within the role
rather than assuming an external caller supplies it.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml`:
- Around line 68-81: The MCO rollout polling in the mco-rollout task currently
suppresses timeout failures by using failed_when: false, which allows the play
to continue even if mcp_poll never reports completion. Update the
polling/verification logic in the mcp_poll task so a timeout is fatal by
default, or make the non-fatal continuation explicitly opt-in via a best-effort
mode. Keep the success check based on mcp_poll.stdout and ensure the final MCO
rollout result step reflects failure when the rollout does not complete.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 80-120: The preflight checks in the etcd probe and the
cluster-operator health check are only logging warnings instead of stopping the
topology transition, which makes this phase non-gating. Update the preflight
logic so the tasks for the etcd pod check and the degraded CO check fail the
play by default, removing the unconditional `|| true` and the `failed_when:
false` behavior; keep any bypass only behind an explicit break-glass override.
Use the existing task names like "[preflight] Check etcd pods" and "[preflight]
Check for degraded cluster operators" to locate the affected steps.
- Around line 129-154: The preflight repair flow for api-int only deletes
entries that already match sno_master0_ip and the verification step only checks
for any DNS answer, so stale mappings can remain undetected. Update the shell
logic in the api-int repair task to remove existing api-int entries by
hostname/FQDN regardless of IP, then make the dns_apiint_verify check confirm
the resolved result includes sno_master0_ip rather than just being non-empty.
Use the existing preflight tasks and variables like sno_libvirt_network,
sno_master0_ip, and sno_cluster_domain to locate and adjust the repair and
verification steps.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml`:
- Around line 24-31: The dnsmasq config in the update-dns task is incorrectly
pinning the wildcard apps record to sno_all_node_ips[0], which can vary because
node ordering is not stable. Update the logic in the sno_dnsmasq_lines
generation to use a stable ingress target such as a VIP/load balancer, or
otherwise emit all intended ingress node IPs instead of selecting the first node
IP. Make the change in the set_fact block that builds the DNS entries so the
*.apps record does not depend on array order.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.yml`:
- Around line 28-48: The wait-etcd task only checks for 3 Ready pods, while the
etcd membership and endpoint health probes in the same flow still suppress
failures via failed_when: false and fallback echo text. Update the wait logic in
wait-etcd.yml so the shell checks using oc exec and etcdctl member list /
endpoint health are allowed to fail the task, or fold their success into the
retry condition, so the role does not proceed to topology.yml until etcd quorum
and health are actually good.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.yml`:
- Around line 4-21: The CSR approval logic in the wait-nodes task is too broad
because it approves every pending request returned by oc get csr. Update the
pending-CSR collection and approval loop in the shell block so it only selects
kubelet client/server CSRs associated with sno_new_nodes, and then approve only
those entries before calling oc adm certificate approve. Use the existing
PENDING/CSR handling in wait-nodes.yml as the place to narrow the filter.
In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.yml`:
- Around line 2-5: The kubeconfig path in sno_kubeconfig_resolved is incorrectly
derived from the default sno_cluster_name, so it can point to the wrong cluster
before discovery happens. Update the mutable-topology SNO-to-3node vars so
sno_cluster_name is not assumed here; instead require sno_cluster_name or
sno_kubeconfig to be provided up front, or change the resolution logic to use a
source that does not depend on the default value. Keep the fix localized to the
sno_kubeconfig_resolved expression and related variable handling.
---
Outside diff comments:
In `@deploy/Makefile`:
- Around line 122-130: Update the Makefile help output so the new `sno-ipi` and
`sno-agent` targets are listed alongside the existing deployment entries in
`make help`. Add matching `@echo` lines in the help block near the
`sno-to-3node` entry, using the same wording/style as the other target
descriptions so the new entrypoints are discoverable.
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml`:
- Around line 26-32: The missing-AGENT_E2E_TEST_SCENARIO failure message in the
config validation task should be updated to include the SNO option now supported
by the method-prefix logic. Adjust the fail message in the config validation
flow so it lists the correct remediation for all supported prefixes, including
the new sno case alongside the existing agent scenarios, and make sure the
guidance matches the prefix handling used in the task that selects by method.
In `@deploy/openshift-clusters/scripts/deploy-cluster.sh`:
- Around line 62-69: The validation in deploy-cluster.sh currently permits every
METHOD with TOPOLOGY=sno, which allows an unsupported sno + kcli path to
continue into the kcli install flow. Update the argument checks around the
TOPOLOGY/METHOD validation blocks to explicitly reject the unsupported
combination before any install logic runs, and keep the failure consistent with
the existing validation messages. Reference the validation section and the
kcli-install path so the guard is placed where the topology and method are first
checked.
---
Nitpick comments:
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-2: The clean-mutable-topology.sh wrapper uses the wrong Bash
shebang, so update the script header to match the repository standard. Replace
the current shebang at the top of the file with the repo-required Bash
interpreter line, keeping the rest of the script unchanged; this is the only fix
needed in the script’s startup header.
In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Around line 1-2: The wrapper script currently uses the non-standard Bash
shebang, so update the script header to match the repository convention by
changing the shebang at the top of sno-to-3node.sh to the repo-standard Bash
path. Keep the rest of the script unchanged, and verify the file still starts
with the expected shebang followed by set -euo pipefail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d1bb47f1-0e94-4c52-bf4c-ea4c9efe1020
📒 Files selected for processing (27)
deploy/Makefiledeploy/openshift-clusters/clean-mutable-topology.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/files/.gitignoredeploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.shdeploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.ymldeploy/openshift-clusters/scripts/clean-mutable-topology.shdeploy/openshift-clusters/scripts/deploy-cluster.shdeploy/openshift-clusters/scripts/sno-to-3node.shdeploy/openshift-clusters/sno-to-3node.yml
| tasks: | ||
| - name: Run mutable topology cleanup | ||
| import_role: | ||
| name: mutable-topology/sno-to-3node | ||
| tasks_from: clean.yml |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Refuse cleanup once the cluster is already HA unless the caller forces it.
This imports tasks/clean.yml unconditionally, and that role deletes master-1/master-2 plus their disks and DHCP state. On a successful 3-node cluster, removing two control-plane nodes here can drop etcd from 3 members to 1 and take the cluster down. Please gate this on the current control-plane topology, or require an explicit force_cleanup=true override before running the destructive role.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 35 - 39,
The cleanup playbook currently imports mutable-topology/sno-to-3node
tasks/clean.yml unconditionally, which can destroy a healthy HA control plane.
Add a topology check in clean-mutable-topology.yml before the import_role so it
only runs when the cluster is not already HA, or require an explicit
force_cleanup=true override to proceed. Use the existing import_role task as the
entry point and gate it with a condition or precheck that prevents destructive
cleanup unless the override is set.
| - name: "[preflight] Add api-int to libvirt network DNS" | ||
| # Using virsh net-update (not lineinfile) so that api-int survives addnhosts | ||
| # regeneration when create-vms.yml calls virsh net-update for master-1/2. | ||
| # libvirt regenerates ostestbm.addnhosts from the network XML on every net-update. | ||
| shell: | | ||
| # Remove any stale api-int entry (idempotent, ignore errors) | ||
| sudo virsh net-update {{ sno_libvirt_network }} delete dns-host \ | ||
| "<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \ | ||
| --live --config 2>/dev/null || true | ||
| # Add the correct entry - triggers addnhosts regeneration with api-int included | ||
| sudo virsh net-update {{ sno_libvirt_network }} add dns-host \ | ||
| "<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \ | ||
| --live --config | ||
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout | ||
| changed_when: true | ||
|
|
||
| - name: "[preflight] Verify api-int DNS after fix" | ||
| shell: | | ||
| dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1 | ||
| register: dns_apiint_verify | ||
| changed_when: false | ||
| failed_when: dns_apiint_verify.stdout | trim == "" | ||
| retries: 3 | ||
| delay: 2 | ||
| until: dns_apiint_verify.stdout | trim != "" | ||
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
The api-int repair path does not remove or verify stale mappings correctly.
The delete call matches api-int on {{ sno_master0_ip }}, so it only removes a record that already points at the desired IP. If api-int is stale and points somewhere else, the old XML entry survives, and the follow-up dig still passes because it only checks for any answer, not for {{ sno_master0_ip }}. Remove existing api-int entries by hostname/FQDN and verify the resolved answers include the expected IP.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`
around lines 129 - 154, The preflight repair flow for api-int only deletes
entries that already match sno_master0_ip and the verification step only checks
for any DNS answer, so stale mappings can remain undetected. Update the shell
logic in the api-int repair task to remove existing api-int entries by
hostname/FQDN regardless of IP, then make the dns_apiint_verify check confirm
the resolved result includes sno_master0_ip rather than just being non-empty.
Use the existing preflight tasks and variables like sno_libvirt_network,
sno_master0_ip, and sno_cluster_domain to locate and adjust the repair and
verification steps.
497dfcc to
134e627
Compare
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/Makefile (1)
135-144: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winAdd the new SNO deploy targets to
make help.
make helpdocuments the migration and cleanup entrypoints, butsno-ipiandsno-agentare still missing from the deployment list, so the new deploy flow is easy to miss.Suggested patch
`@echo` " fencing-assisted - Deploy hub + spoke TNF cluster via assisted installer" + `@echo` " sno-ipi - Deploy SNO IPI cluster (non-interactive)" + `@echo` " sno-agent - Deploy SNO Agent cluster (non-interactive)" `@echo` " sno-to-3node - Transition existing SNO cluster to 3-node HA (platform:none)"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/Makefile` around lines 135 - 144, The help output in Makefile is missing the new SNO deployment entrypoints, so update the `make help` target’s OpenShift Cluster Management section to include `sno-ipi` and `sno-agent` alongside the existing deploy/cleanup commands. Keep the new entries grouped with the other deployment targets and use the same `@echo` pattern in the `help` recipe so the new flow is discoverable.
🧹 Nitpick comments (3)
deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh (1)
1-41: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMake this file clearly a config fragment or a real script.
It currently has a Bash shebang, but it does not follow the repo’s shell-script contract (
#!/usr/bin/bashandset -euo pipefail). Either drop the shebang and document it as a sourced config example, or make it conform to the script standard. As per coding guidelines, "Shell scripts should use#!/usr/bin/bashshebang" and "Include error handling withset -euo pipefail".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh` around lines 1 - 41, The file is currently ambiguous between a sourced config fragment and an executable shell script, and its header does not match the repo shell-script contract. Either remove the shebang and clearly treat it as a sourced config example, or convert it into a proper script by updating the header in config_sno_example.sh to use the required shell standard and add strict error handling near the top. Make the choice consistent with how this file is used, and keep the existing export-based config values under that contract.Source: Coding guidelines
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)
1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the repository Bash shebang.
This wrapper already follows the repo's strict-mode pattern, but the shebang should be
#!/usr/bin/bashto match the shell-script standard.Suggested patch
-#!/bin/bash +#!/usr/bin/bashAs per coding guidelines, "Shell scripts should use
#!/usr/bin/bashshebang".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` around lines 1 - 2, The shell script header in sno-to-3node.sh should use the repository-standard Bash interpreter path instead of the generic Bash shebang. Update the script’s shebang at the top of the file to match the shell-script convention used across the repo, while keeping the existing strict-mode settings unchanged.Source: Coding guidelines
deploy/openshift-clusters/scripts/clean-mutable-topology.sh (1)
1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the repository Bash shebang.
This wrapper already follows the repo's strict-mode pattern, but the shebang should be
#!/usr/bin/bashto match the shell-script standard.Suggested patch
-#!/bin/bash +#!/usr/bin/bashAs per coding guidelines, "Shell scripts should use
#!/usr/bin/bashshebang".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 - 2, The script uses the wrong Bash shebang at the top of clean-mutable-topology.sh; update the shebang to the repository-standard /usr/bin/bash while keeping the existing strict-mode settings intact. Make this change at the script header so the wrapper matches the repo shell-script convention used alongside set -euo pipefail.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml`:
- Around line 15-19: The `oc project` retry block in `main.yml` is swallowing
failures because `failed_when: false` lets the handler succeed even after all
retries are exhausted. Update the handler logic around `oc_project_result` so
that a nonzero `rc` after the final retry actually fails the task, or at minimum
emits an explicit warning/error on the last attempt; keep the retry behavior
intact but make the terminal failure visible.
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml`:
- Line 43: The missing-scenario remediation for agent methods is incomplete
because the guidance still only mentions TNF_IPV4 and TNA_IPV4 even though sno
is now supported; update the failure message in the agent method validation
logic to include SNO and point users to SNO_IPV4 as well. Make the change in the
same place that checks AGENT_E2E_TEST_SCENARIO so the message stays aligned with
the scenario mapping used by sno.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`:
- Around line 57-58: The cleanup in boot-nodes is leaving UEFI guest definitions
behind because the `virsh undefine` call in the VM teardown sequence does not
remove the OVMF NVRAM varstore. Update the `virsh undefine` usage in the
`boot-nodes` task so it includes `--nvram` for these guests, keeping the
existing `virsh destroy`/cleanup flow in place and ensuring the old domain is
fully removed before the next `virt-install` runs.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 114-124: The cleanup in the `clean.yml` task only reloads libvirt
dnsmasq, but it leaves the NetworkManager dnsmasq override created by
`update-dns.yml` in place. Update the `[clean]` flow to also remove `{{
sno_nm_dnsmasq_conf }}` and then reload or restart `NetworkManager` so stale
local DNS entries for `api`, `api-int`, and `*.apps` are cleared. Use the
existing cleanup block near `[clean] Reload libvirt dnsmasq` and `[clean]
Cleanup complete` to add this DNS teardown step.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`:
- Around line 130-137: The "[create-vms] Add DNS entries to libvirt network"
task currently hides failures by setting failed_when: false on the virsh
net-update add dns-host call. Remove the failure masking so the task in
create-vms.yml fails when virsh net-update cannot update dns-host, and keep the
existing loop over sno_new_nodes so each hostname update is validated. If a
fallback is needed, handle it explicitly in a separate step rather than allowing
this task to report success when the libvirt DNS update did not apply.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.yml`:
- Around line 2-7: The `ignition.yml` probes are unbounded and can hang the role
during API or DNS stalls. Update the `"[ignition] Extract MCS CA certificate"`
task and the later `curl` probe task to use explicit timeouts for the `oc` and
`curl` commands, keeping the existing task structure and symbols like
`mcs_ca_raw` and the probe step intact. Ensure both commands fail quickly
instead of waiting indefinitely.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.yml`:
- Around line 42-64: The drain auto-fix currently applies to every cordoned node
in the mco-rollout task, which can bypass normal MCO drains or intentional
maintenance cordons. Update the logic around CORDONED_NODES so it only runs when
there is a stale MCO-drain signal, or change sno_auto_fix_drain to be opt-in by
default instead of default(true). Keep the uncordon/lastAppliedDrain/MCC restart
flow behind that stricter gate so normal SchedulingDisabled nodes are not
touched.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 135-143: The embedded Python passed to python3 -c in the preflight
task is not indented under the shell block, so the YAML parser can misread it as
new keys and fail to load the file. Fix the shell: | content in preflight.yml by
indenting the full Python snippet consistently beneath the CURRENT_IP assignment
while keeping the same logic in the task that queries virsh net-dumpxml for
api-int.{{ sno_cluster_domain }}.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.yml`:
- Around line 2-5: The CVO disable step in the topology transition flow is not
protected by a restore path, so a failure can leave the cluster without a
running CVO. Update the SNO-to-3node topology task sequence around the
scale-down step to use a block with an always section that scales
cluster-version-operator back up, or add an explicit restore task after the
rollout phases if CVO must remain down temporarily. Make sure the fix is tied to
the existing topology task(s) that perform the CVO scale-down so the recovery
runs even when later phases fail.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.yml`:
- Around line 57-69: The DNS verification in update-dns.yml is too weak because
the API check only tests for any non-empty result and the apps check never
fails, so phase 8 can pass with stale or partial records. Tighten the validation
in the existing dns verification tasks by comparing the dig results against
sno_all_node_ips[0] for api and sno_all_node_ips for test.apps, and make the
checks fail/retry until the returned records match instead of accepting any
response.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`:
- Around line 91-111: The final verification in the [verify] task only checks
the number of node rows and can pass even if some nodes are NotReady. Update the
existing node verification around verify_nodes so it explicitly asserts all
three nodes are Ready before the migration success message is emitted, and keep
the HA/etcd checks as-is. Use the verify_nodes assertion block and the "[verify]
SNO to 3-node transition verified successfully" debug step to locate where to
add the readiness condition.
In `@deploy/openshift-clusters/scripts/deploy-cluster.sh`:
- Around line 62-63: The topology validation in deploy-cluster.sh currently
allows sno unconditionally, which lets unsupported kcli SNO requests pass into
the playbook and fail later. Update the TOPOLOGY check in the deploy-cluster.sh
flow so sno is only accepted for supported deployment methods, using the
existing method selection logic in the script, or alternatively align
kcli-install/tasks/validate.yml so kcli explicitly supports sno. Keep the fix
centered around the topology validation gate and the kcli validation path.
---
Outside diff comments:
In `@deploy/Makefile`:
- Around line 135-144: The help output in Makefile is missing the new SNO
deployment entrypoints, so update the `make help` target’s OpenShift Cluster
Management section to include `sno-ipi` and `sno-agent` alongside the existing
deploy/cleanup commands. Keep the new entries grouped with the other deployment
targets and use the same `@echo` pattern in the `help` recipe so the new flow is
discoverable.
---
Nitpick comments:
In
`@deploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.sh`:
- Around line 1-41: The file is currently ambiguous between a sourced config
fragment and an executable shell script, and its header does not match the repo
shell-script contract. Either remove the shebang and clearly treat it as a
sourced config example, or convert it into a proper script by updating the
header in config_sno_example.sh to use the required shell standard and add
strict error handling near the top. Make the choice consistent with how this
file is used, and keep the existing export-based config values under that
contract.
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-2: The script uses the wrong Bash shebang at the top of
clean-mutable-topology.sh; update the shebang to the repository-standard
/usr/bin/bash while keeping the existing strict-mode settings intact. Make this
change at the script header so the wrapper matches the repo shell-script
convention used alongside set -euo pipefail.
In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Around line 1-2: The shell script header in sno-to-3node.sh should use the
repository-standard Bash interpreter path instead of the generic Bash shebang.
Update the script’s shebang at the top of the file to match the shell-script
convention used across the repo, while keeping the existing strict-mode settings
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2df4a7c9-40e6-4c41-b37a-324e24187240
📒 Files selected for processing (27)
deploy/Makefiledeploy/openshift-clusters/clean-mutable-topology.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/files/.gitignoredeploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.shdeploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.ymldeploy/openshift-clusters/scripts/clean-mutable-topology.shdeploy/openshift-clusters/scripts/deploy-cluster.shdeploy/openshift-clusters/scripts/sno-to-3node.shdeploy/openshift-clusters/sno-to-3node.yml
| register: oc_project_result | ||
| retries: 5 | ||
| delay: 15 | ||
| until: oc_project_result.rc == 0 | ||
| failed_when: false |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Don’t swallow an exhausted oc project failure.
After all retries are used, failed_when: false still lets this handler pass, so a real kubeconfig/auth/API failure becomes silent and the role continues without ever selecting the target project. Fail when oc_project_result.rc != 0, or emit an explicit warning/error on the terminal retry.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml`
around lines 15 - 19, The `oc project` retry block in `main.yml` is swallowing
failures because `failed_when: false` lets the handler succeed even after all
retries are exhausted. Update the handler logic around `oc_project_result` so
that a nonzero `rc` after the final retry actually fails the task, or at minimum
emits an explicit warning/error on the last attempt; keep the retry behavior
intact but make the terminal failure visible.
| expected_prefix: | ||
| arbiter: "TNA" | ||
| fencing: "TNF" | ||
| sno: "SNO" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Update the missing-scenario remediation for sno too.
Adding sno here is incomplete: the earlier Fail if agent method is used without AGENT_E2E_TEST_SCENARIO message still only tells users to set TNF_IPV4 or TNA_IPV4. agent + sno will now fail with stale guidance instead of pointing to SNO_IPV4.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.yml` at
line 43, The missing-scenario remediation for agent methods is incomplete
because the guidance still only mentions TNF_IPV4 and TNA_IPV4 even though sno
is now supported; update the failure message in the agent method validation
logic to include SNO and point users to SNO_IPV4 as well. Make the change in the
same place that checks AGENT_E2E_TEST_SCENARIO so the message stays aligned with
the scenario mapping used by sno.
| sudo virsh destroy "$VM_NAME" 2>/dev/null || true | ||
| sudo virsh undefine "$VM_NAME" 2>/dev/null || true |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== boot-nodes.yml ==\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml | sed -n '1,140p'
printf '\n== create-vms.yml ==\n'
nl -ba deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml | sed -n '1,220p'Repository: openshift-eng/two-node-toolbox
Length of output: 230
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== libvirt semantics search ==\n'
python3 - <<'PY'
import urllib.request, urllib.parse, re
query = urllib.parse.quote('libvirt virsh undefine --nvram OVMF NVRAM semantics')
url = f'https://duckduckgo.com/html/?q={query}'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
with urllib.request.urlopen(req, timeout=20) as resp:
html = resp.read().decode('utf-8', 'ignore')
for pat in [r'--nvram', r'undefine', r'NVRAM', r'OVMF']:
print(f'[{pat}]', 'found' if re.search(pat, html, re.I) else 'not found')
print(html[:2000])
PYRepository: openshift-eng/two-node-toolbox
Length of output: 2794
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== boot-nodes.yml ==\n'
sed -n '1,140p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml
printf '\n== create-vms.yml ==\n'
sed -n '1,220p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml
printf '\n== repo virsh undefine search ==\n'
grep -RIn --exclude-dir=.git -E 'virsh undefine|--nvram' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node | sed -n '1,120p'Repository: openshift-eng/two-node-toolbox
Length of output: 11126
🌐 Web query:
libvirt virsh undefine --nvram OVMF NVRAM documentation
💡 Result:
When using the virsh undefine command on a domain that utilizes UEFI/NVRAM, libvirt requires explicit instruction on how to handle the associated NVRAM file [1][2]. If a domain has an NVRAM file defined in its configuration and you attempt to undefine it without providing either the --nvram or --keep-nvram flag, the operation will fail with an error [1][3][4]. The relevant flags are: --nvram: Explicitly instructs libvirt to remove the associated NVRAM (varstore) file when the domain is undefined [5][2]. --keep-nvram: Explicitly instructs libvirt to retain the NVRAM file after the domain has been undefined [6][2]. This requirement was implemented to prevent accidental deletion of persistent NVRAM data (such as UEFI variables) [1][7]. If the domain is inactive, the undefine command removes its persistent configuration [8][2]. If the domain is active (running), it becomes transient—meaning it will continue to run until it is shut down, at which point it will be removed [8][2]. In both cases, the choice to delete or preserve the NVRAM file must be specified using these flags [6].
Citations:
- 1: https://bugzilla.redhat.com/show_bug.cgi?id=2019524
- 2: https://linuxcommandlibrary.com/man/virsh-undefine
- 3: https://bugzilla.redhat.com/show_bug.cgi?id=1751596
- 4: https://bugzilla.redhat.com/show_bug.cgi?id=1522706
- 5: https://www.libvirt.org/manpages/virsh.html
- 6: https://bugzilla.redhat.com/show_bug.cgi?id=2007659
- 7: https://patchew.org/Libvirt/20191015083122.21217-1-pmores@redhat.com/
- 8: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-virsh-delete
virsh undefine needs --nvram for these UEFI guests. virt-install defines each VM with an OVMF NVRAM varstore, so this cleanup can fail and leave the old domain behind; || true then hides the error until the next virt-install hits “domain already exists.”
Suggested fix
- sudo virsh undefine "$VM_NAME" 2>/dev/null || true
+ sudo virsh undefine "$VM_NAME" --nvram 2>/dev/null || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sudo virsh destroy "$VM_NAME" 2>/dev/null || true | |
| sudo virsh undefine "$VM_NAME" 2>/dev/null || true | |
| sudo virsh destroy "$VM_NAME" 2>/dev/null || true | |
| sudo virsh undefine "$VM_NAME" --nvram 2>/dev/null || true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`
around lines 57 - 58, The cleanup in boot-nodes is leaving UEFI guest
definitions behind because the `virsh undefine` call in the VM teardown sequence
does not remove the OVMF NVRAM varstore. Update the `virsh undefine` usage in
the `boot-nodes` task so it includes `--nvram` for these guests, keeping the
existing `virsh destroy`/cleanup flow in place and ensuring the old domain is
fully removed before the next `virt-install` runs.
| - name: "[clean] Reload libvirt dnsmasq" | ||
| shell: | | ||
| sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true | ||
| changed_when: true | ||
| failed_when: false | ||
|
|
||
| - name: "[clean] Cleanup complete" | ||
| debug: | ||
| msg: >- | ||
| Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}. | ||
| Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed. |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Cleanup leaves the NetworkManager dnsmasq override behind.
update-dns.yml creates {{ sno_nm_dnsmasq_conf }} and restarts NetworkManager, but this cleanup only HUPs libvirt dnsmasq. After the VMs and libvirt DNS records are gone, the host can still resolve api, api-int, and *.apps from stale local overrides. Remove that file and reload/restart NetworkManager here as part of the DNS cleanup path.
Suggested cleanup addition
+- name: "[clean] Remove NM dnsmasq override"
+ file:
+ path: "{{ sno_nm_dnsmasq_conf }}"
+ state: absent
+ become: true
+ failed_when: false
+
+- name: "[clean] Restart NetworkManager to drop local DNS overrides"
+ systemd:
+ name: NetworkManager
+ state: restarted
+ become: true
+ failed_when: false
+
- name: "[clean] Reload libvirt dnsmasq"
shell: |
sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: "[clean] Reload libvirt dnsmasq" | |
| shell: | | |
| sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true | |
| changed_when: true | |
| failed_when: false | |
| - name: "[clean] Cleanup complete" | |
| debug: | |
| msg: >- | |
| Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}. | |
| Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed. | |
| - name: "[clean] Remove NM dnsmasq override" | |
| file: | |
| path: "{{ sno_nm_dnsmasq_conf }}" | |
| state: absent | |
| become: true | |
| failed_when: false | |
| - name: "[clean] Restart NetworkManager to drop local DNS overrides" | |
| systemd: | |
| name: NetworkManager | |
| state: restarted | |
| become: true | |
| failed_when: false | |
| - name: "[clean] Reload libvirt dnsmasq" | |
| shell: | | |
| sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true | |
| changed_when: true | |
| failed_when: false | |
| - name: "[clean] Cleanup complete" | |
| debug: | |
| msg: >- | |
| Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}. | |
| Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`
around lines 114 - 124, The cleanup in the `clean.yml` task only reloads libvirt
dnsmasq, but it leaves the NetworkManager dnsmasq override created by
`update-dns.yml` in place. Update the `[clean]` flow to also remove `{{
sno_nm_dnsmasq_conf }}` and then reload or restart `NetworkManager` so stale
local DNS entries for `api`, `api-int`, and `*.apps` are cleared. Use the
existing cleanup block near `[clean] Reload libvirt dnsmasq` and `[clean]
Cleanup complete` to add this DNS teardown step.
| - name: "[create-vms] Add DNS entries to libvirt network" | ||
| shell: | | ||
| sudo virsh net-update {{ sno_libvirt_network }} add dns-host \ | ||
| "<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \ | ||
| --live --config | ||
| loop: "{{ sno_new_nodes }}" | ||
| failed_when: false | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Don't mask virsh net-update add dns-host failures.
This call is the durable source of truth for dnsmasq. If it fails here, the later lineinfile fallback only fixes the current addnhosts file, and a later net-update/network restart can drop the hostname while this phase still reports success.
Suggested fix
- name: "[create-vms] Add DNS entries to libvirt network"
shell: |
sudo virsh net-update {{ sno_libvirt_network }} add dns-host \
"<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \
--live --config
loop: "{{ sno_new_nodes }}"
- failed_when: false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: "[create-vms] Add DNS entries to libvirt network" | |
| shell: | | |
| sudo virsh net-update {{ sno_libvirt_network }} add dns-host \ | |
| "<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \ | |
| --live --config | |
| loop: "{{ sno_new_nodes }}" | |
| failed_when: false | |
| - name: "[create-vms] Add DNS entries to libvirt network" | |
| shell: | | |
| sudo virsh net-update {{ sno_libvirt_network }} add dns-host \ | |
| "<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \ | |
| --live --config | |
| loop: "{{ sno_new_nodes }}" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`
around lines 130 - 137, The "[create-vms] Add DNS entries to libvirt network"
task currently hides failures by setting failed_when: false on the virsh
net-update add dns-host call. Remove the failure masking so the task in
create-vms.yml fails when virsh net-update cannot update dns-host, and keep the
existing loop over sno_new_nodes so each hostname update is validated. If a
fallback is needed, handle it explicitly in a separate step rather than allowing
this task to report success when the libvirt DNS update did not apply.
| - name: "[verify] Assert 3 nodes" | ||
| assert: | ||
| that: (verify_nodes.stdout_lines | length) >= 3 | ||
| fail_msg: "Expected 3+ nodes, found {{ verify_nodes.stdout_lines | length }}" | ||
|
|
||
| - name: "[verify] Assert HA topology" | ||
| assert: | ||
| that: verify_topo.stdout == "HighlyAvailable" | ||
| fail_msg: "Expected HighlyAvailable, got {{ verify_topo.stdout }}" | ||
|
|
||
| - name: "[verify] Assert 3 etcd members" | ||
| assert: | ||
| that: verify_etcd.stdout | trim | int >= 3 | ||
| fail_msg: "Expected 3+ etcd pods, found {{ verify_etcd.stdout | trim }}" | ||
|
|
||
| - name: "[verify] SNO to 3-node transition verified successfully" | ||
| debug: | ||
| msg: >- | ||
| MIGRATION COMPLETE: {{ verify_nodes.stdout_lines | length }} nodes, | ||
| {{ verify_etcd.stdout | trim }} etcd members, | ||
| topology={{ verify_topo.stdout }}, CVO={{ verify_cvo.stdout }} replicas. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Final verification never proves the three nodes are actually Ready.
Line 93 only counts rows from oc get nodes --no-headers, so a cluster with NotReady control-plane nodes still reaches the success path at Lines 106-111. Assert readiness explicitly before declaring the migration complete.
Suggested fix
- name: "[verify] Assert 3 nodes"
assert:
- that: (verify_nodes.stdout_lines | length) >= 3
- fail_msg: "Expected 3+ nodes, found {{ verify_nodes.stdout_lines | length }}"
+ that:
+ - (verify_nodes.stdout_lines | length) >= 3
+ - (verify_nodes.stdout_lines | select('search', '\sReady(?:,|$)') | list | length) >= 3
+ fail_msg: "Expected 3 Ready nodes, got:\n{{ verify_nodes.stdout }}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: "[verify] Assert 3 nodes" | |
| assert: | |
| that: (verify_nodes.stdout_lines | length) >= 3 | |
| fail_msg: "Expected 3+ nodes, found {{ verify_nodes.stdout_lines | length }}" | |
| - name: "[verify] Assert HA topology" | |
| assert: | |
| that: verify_topo.stdout == "HighlyAvailable" | |
| fail_msg: "Expected HighlyAvailable, got {{ verify_topo.stdout }}" | |
| - name: "[verify] Assert 3 etcd members" | |
| assert: | |
| that: verify_etcd.stdout | trim | int >= 3 | |
| fail_msg: "Expected 3+ etcd pods, found {{ verify_etcd.stdout | trim }}" | |
| - name: "[verify] SNO to 3-node transition verified successfully" | |
| debug: | |
| msg: >- | |
| MIGRATION COMPLETE: {{ verify_nodes.stdout_lines | length }} nodes, | |
| {{ verify_etcd.stdout | trim }} etcd members, | |
| topology={{ verify_topo.stdout }}, CVO={{ verify_cvo.stdout }} replicas. | |
| - name: "[verify] Assert 3 nodes" | |
| assert: | |
| that: | |
| - (verify_nodes.stdout_lines | length) >= 3 | |
| - (verify_nodes.stdout_lines | select('search', '\sReady(?:,|$)') | list | length) >= 3 | |
| fail_msg: "Expected 3 Ready nodes, got:\n{{ verify_nodes.stdout }}" | |
| - name: "[verify] Assert HA topology" | |
| assert: | |
| that: verify_topo.stdout == "HighlyAvailable" | |
| fail_msg: "Expected HighlyAvailable, got {{ verify_topo.stdout }}" | |
| - name: "[verify] Assert 3 etcd members" | |
| assert: | |
| that: verify_etcd.stdout | trim | int >= 3 | |
| fail_msg: "Expected 3+ etcd pods, found {{ verify_etcd.stdout | trim }}" | |
| - name: "[verify] SNO to 3-node transition verified successfully" | |
| debug: | |
| msg: >- | |
| MIGRATION COMPLETE: {{ verify_nodes.stdout_lines | length }} nodes, | |
| {{ verify_etcd.stdout | trim }} etcd members, | |
| topology={{ verify_topo.stdout }}, CVO={{ verify_cvo.stdout }} replicas. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`
around lines 91 - 111, The final verification in the [verify] task only checks
the number of node rows and can pass even if some nodes are NotReady. Update the
existing node verification around verify_nodes so it explicitly asserts all
three nodes are Ready before the migration success message is emitted, and keep
the HA/etcd checks as-is. Use the verify_nodes assertion block and the "[verify]
SNO to 3-node transition verified successfully" debug step to locate where to
add the readiness condition.
|
/label tide/merge-method-squash |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/openshift-clusters/scripts/deploy-cluster.sh (1)
52-63: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winKeep
--helpaligned with the accepted topologies.After allowing
snohere, the file header and the-h|--helpbranch still advertise onlyarbiter|fencing.deploy-cluster.sh --helpnow contradicts the actual interface.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/deploy-cluster.sh` around lines 52 - 63, Update the CLI documentation in deploy-cluster.sh so it matches the validation logic that now accepts sno. In the script header and the -h|--help handling, replace any arbiter|fencing-only wording with arbiter|fencing|sno, and ensure the usage/examples for topology reference the same accepted values as the TOPOLOGY validation block.
♻️ Duplicate comments (1)
deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml (1)
15-31: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winStill fail the handler after the last
oc projectretry.Line 19 keeps the task successful even when all retries are exhausted, and Lines 22-31 only log a warning. That still hides real kubeconfig/auth/API failures and lets the role continue in a bad state.
Proposed fix
- name: Run oc project command: oc --kubeconfig="{{ kubeconfig_path }}" project openshift-machine-api when: - kubeconfig_path is defined - kubeconfig_stat.stat.exists | default(false) changed_when: false register: oc_project_result retries: 5 delay: 15 until: oc_project_result.rc == 0 - failed_when: false + failed_when: oc_project_result.rc != 0 listen: Set OCP project - -- name: Warn if oc project failed after all retries - debug: - msg: >- - WARNING: 'oc project openshift-machine-api' still failing after all retries - (rc={{ oc_project_result.rc | default('n/a') }}). - stderr: {{ oc_project_result.stderr | default('') | trim }} - when: - - kubeconfig_path is defined - - kubeconfig_stat.stat.exists | default(false) - - oc_project_result.rc | default(0) != 0 - listen: Set OCP project🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml` around lines 15 - 31, The handler currently masks a failed oc_project_result by setting failed_when: false and only emitting a warning in the follow-up debug task, so the role can continue after all retries are exhausted. Update the Set OCP project handler logic to use oc_project_result in a failure condition after the retries are done, and keep the warning only as additional context; ensure the failure is tied to the existing oc_project_result check and the Set OCP project handler rather than just logging in the debug task.
🧹 Nitpick comments (2)
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)
1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the repository-standard Bash shebang.
This new script uses
#!/bin/bash, but the repo rule for*.shis#!/usr/bin/bash.As per coding guidelines,
**/*.sh:Shell scripts should use #!/usr/bin/bash shebang.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` around lines 1 - 2, The new shell script uses the wrong Bash shebang; update the top of sno-to-3node.sh to match the repository-standard `#!/usr/bin/bash` used for `*.sh` scripts. Keep the rest of the script unchanged and make sure the entrypoint shebang is consistent with the repo’s shell script convention.Source: Coding guidelines
deploy/openshift-clusters/scripts/clean-mutable-topology.sh (1)
1-2: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the repository-standard Bash shebang.
This new script uses
#!/bin/bash, but the repo rule for*.shis#!/usr/bin/bash.As per coding guidelines,
**/*.sh:Shell scripts should use #!/usr/bin/bash shebang.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 - 2, The new shell script is using the wrong Bash interpreter path; update the shebang in the script header to match the repository-standard format used for .sh files. Locate the script by its top-level shebang and replace the current Bash path with the repo-required /usr/bin/bash form so it aligns with the shell script convention.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deploy/Makefile`:
- Around line 79-84: The new deploy targets sno-ipi and sno-agent are missing
from the make help output, so update the help/documentation logic in
deploy/Makefile to advertise them alongside sno-to-3node. Add these targets to
the same help entry list or help-generation block used by make help so users can
discover the new entrypoints from the main CLI.
In `@deploy/openshift-clusters/clean-mutable-topology.yml`:
- Around line 35-53: The topology guard in the live cluster check is too
permissive because it allows cleanup to continue when the detected
controlPlaneTopology is Unknown. Update the logic around the live_topology check
and the Abort if cluster is HA and force_cleanup is not set step so destructive
cleanup only proceeds when the topology is explicitly SingleReplica, or when
force_cleanup is true; otherwise fail closed on Unknown or any non-explicit
result. Use the existing live_topology, clean.yml, and controlPlaneTopology
handling to make the condition explicit and prevent master-1/master-2 deletion
unless the topology is confirmed safe.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 127-130: The cleanup task still only reloads NetworkManager, so
the old dnsmasq state may survive and keep stale DNS overrides. Update the
cleanup step in the NetworkManager reset flow to use a full restart after
removing `{{ sno_nm_dnsmasq_conf }}`, matching the behavior used in
`update-dns.yml`; keep the change in the `clean.yml` task that manages
NetworkManager so the stale `api`/`api-int`/`*.apps` answers are fully cleared.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`:
- Around line 2-9: The stale VM cleanup in the create-vms task is missing the
OVMF-specific NVRAM removal flag, which can leave domain state behind and
interfere with recreating the VM. Update the `virsh undefine` call in the
`"[create-vms] Cleanup stale VMs"` task to include `--nvram`, keeping the rest
of the cleanup flow in `create-vms.yml` unchanged.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 155-167: The DNS verification in the preflight task is using
substring matching, which can incorrectly accept a different IP when one answer
prefixes another. Update the checks in the preflight shell task and the
`dns_apiint_verify` validation to compare against trimmed `stdout_lines` entries
instead of `in dns_apiint.stdout`, so the `dig` result for `api-int` must match
the exact `sno_master0_ip` value.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`:
- Around line 47-78: The verification step in the cluster operator summary only
reports degraded/unavailable operators and never blocks completion, so the
migration can still finish in a bad state. Update the verify task that runs the
oc get co summary and stores it in verify_co to explicitly fail when any
operator is Degraded or Available=False, using the existing degraded and
unavailable lists as the gate, so the role stops before emitting MIGRATION
COMPLETE.
---
Outside diff comments:
In `@deploy/openshift-clusters/scripts/deploy-cluster.sh`:
- Around line 52-63: Update the CLI documentation in deploy-cluster.sh so it
matches the validation logic that now accepts sno. In the script header and the
-h|--help handling, replace any arbiter|fencing-only wording with
arbiter|fencing|sno, and ensure the usage/examples for topology reference the
same accepted values as the TOPOLOGY validation block.
---
Duplicate comments:
In `@deploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.yml`:
- Around line 15-31: The handler currently masks a failed oc_project_result by
setting failed_when: false and only emitting a warning in the follow-up debug
task, so the role can continue after all retries are exhausted. Update the Set
OCP project handler logic to use oc_project_result in a failure condition after
the retries are done, and keep the warning only as additional context; ensure
the failure is tied to the existing oc_project_result check and the Set OCP
project handler rather than just logging in the debug task.
---
Nitpick comments:
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-2: The new shell script is using the wrong Bash interpreter
path; update the shebang in the script header to match the repository-standard
format used for .sh files. Locate the script by its top-level shebang and
replace the current Bash path with the repo-required /usr/bin/bash form so it
aligns with the shell script convention.
In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Around line 1-2: The new shell script uses the wrong Bash shebang; update the
top of sno-to-3node.sh to match the repository-standard `#!/usr/bin/bash` used
for `*.sh` scripts. Keep the rest of the script unchanged and make sure the
entrypoint shebang is consistent with the repo’s shell script convention.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: bd4f4653-4596-463f-be41-02bedea644dd
📒 Files selected for processing (27)
deploy/Makefiledeploy/openshift-clusters/clean-mutable-topology.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/files/.gitignoredeploy/openshift-clusters/roles/dev-scripts/install-dev/files/config_sno_example.shdeploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/vars/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.ymldeploy/openshift-clusters/scripts/clean-mutable-topology.shdeploy/openshift-clusters/scripts/deploy-cluster.shdeploy/openshift-clusters/scripts/sno-to-3node.shdeploy/openshift-clusters/sno-to-3node.yml
| sno-ipi: | ||
| @./openshift-clusters/scripts/deploy-cluster.sh --topology sno --method ipi | ||
|
|
||
| sno-agent: | ||
| @./openshift-clusters/scripts/deploy-cluster.sh --topology sno --method agent | ||
|
|
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Document the new SNO deploy targets in make help.
sno-ipi and sno-agent are added here, but the help output only advertises sno-to-3node. That makes the new entrypoints hard to discover from the main CLI.
Suggested help update
`@echo` " fencing-ipi - Deploy fencing IPI cluster (non-interactive)"
`@echo` " fencing-agent - Deploy fencing Agent cluster (non-interactive) (WIP Experimental)"
`@echo` " arbiter-ipi - Deploy arbiter IPI cluster (non-interactive)"
`@echo` " arbiter-agent - Deploy arbiter Agent cluster (non-interactive)"
+ `@echo` " sno-ipi - Deploy SNO IPI cluster (non-interactive)"
+ `@echo` " sno-agent - Deploy SNO Agent cluster (non-interactive)"
`@echo` " arbiter-kcli - Deploy arbiter cluster using kcli (non-interactive)"
`@echo` " fencing-kcli - Deploy fencing cluster using kcli (non-interactive)"
`@echo` " fencing-assisted - Deploy hub + spoke TNF cluster via assisted installer"
`@echo` " sno-to-3node - Transition existing SNO cluster to 3-node HA (platform:none)"Also applies to: 135-145
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/Makefile` around lines 79 - 84, The new deploy targets sno-ipi and
sno-agent are missing from the make help output, so update the
help/documentation logic in deploy/Makefile to advertise them alongside
sno-to-3node. Add these targets to the same help entry list or help-generation
block used by make help so users can discover the new entrypoints from the main
CLI.
| - name: Check live cluster topology | ||
| shell: | | ||
| KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \ | ||
| oc get infrastructure cluster \ | ||
| -o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown" | ||
| register: live_topology | ||
| changed_when: false | ||
| failed_when: false | ||
| ignore_errors: true | ||
|
|
||
| - name: Abort if cluster is HA and force_cleanup is not set | ||
| ansible.builtin.fail: | ||
| msg: >- | ||
| Cluster controlPlaneTopology is '{{ live_topology.stdout | trim }}'. | ||
| Refusing to remove master-1 and master-2 from a live HA cluster. | ||
| Re-run with -e "force_cleanup=true" to override. | ||
| when: | ||
| - live_topology.stdout | trim == "HighlyAvailable" | ||
| - not (force_cleanup | default(false) | bool) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Fail closed when topology detection is unavailable.
This still proceeds on Unknown. If oc/kubeconfig access is broken on an already-HA cluster, the playbook skips the guard and drops into clean.yml, which deletes master-1 and master-2 anyway. Destructive cleanup should only run when the topology is explicitly SingleReplica, or when force_cleanup=true is set.
Suggested guard change
- - name: Abort if cluster is HA and force_cleanup is not set
+ - name: Abort unless cluster is SingleReplica or force_cleanup is set
ansible.builtin.fail:
msg: >-
Cluster controlPlaneTopology is '{{ live_topology.stdout | trim }}'.
- Refusing to remove master-1 and master-2 from a live HA cluster.
+ Refusing to remove master-1 and master-2 unless the live cluster is
+ explicitly SingleReplica.
Re-run with -e "force_cleanup=true" to override.
when:
- - live_topology.stdout | trim == "HighlyAvailable"
- not (force_cleanup | default(false) | bool)
+ - live_topology.stdout | trim != "SingleReplica"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Check live cluster topology | |
| shell: | | |
| KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \ | |
| oc get infrastructure cluster \ | |
| -o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown" | |
| register: live_topology | |
| changed_when: false | |
| failed_when: false | |
| ignore_errors: true | |
| - name: Abort if cluster is HA and force_cleanup is not set | |
| ansible.builtin.fail: | |
| msg: >- | |
| Cluster controlPlaneTopology is '{{ live_topology.stdout | trim }}'. | |
| Refusing to remove master-1 and master-2 from a live HA cluster. | |
| Re-run with -e "force_cleanup=true" to override. | |
| when: | |
| - live_topology.stdout | trim == "HighlyAvailable" | |
| - not (force_cleanup | default(false) | bool) | |
| - name: Abort unless cluster is SingleReplica or force_cleanup is set | |
| ansible.builtin.fail: | |
| msg: >- | |
| Cluster controlPlaneTopology is '{{ live_topology.stdout | trim }}'. | |
| Refusing to remove master-1 and master-2 unless the live cluster is | |
| explicitly SingleReplica. | |
| Re-run with -e "force_cleanup=true" to override. | |
| when: | |
| - not (force_cleanup | default(false) | bool) | |
| - live_topology.stdout | trim != "SingleReplica" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 35 - 53,
The topology guard in the live cluster check is too permissive because it allows
cleanup to continue when the detected controlPlaneTopology is Unknown. Update
the logic around the live_topology check and the Abort if cluster is HA and
force_cleanup is not set step so destructive cleanup only proceeds when the
topology is explicitly SingleReplica, or when force_cleanup is true; otherwise
fail closed on Unknown or any non-explicit result. Use the existing
live_topology, clean.yml, and controlPlaneTopology handling to make the
condition explicit and prevent master-1/master-2 deletion unless the topology is
confirmed safe.
| - name: "[clean] Reload NetworkManager to clear stale DNS overrides" | ||
| systemd: | ||
| name: NetworkManager | ||
| state: reloaded |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Restart NetworkManager here too.
update-dns.yml had to restart NetworkManager to apply {{ sno_nm_dnsmasq_conf }}, but cleanup only reloads after deleting that file. If the embedded dnsmasq process keeps the old config across reload, stale api/api-int/*.apps answers survive cleanup and poison the next run.
Proposed fix
- name: "[clean] Reload NetworkManager to clear stale DNS overrides"
systemd:
name: NetworkManager
- state: reloaded
+ state: restarted
become: true
failed_when: false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: "[clean] Reload NetworkManager to clear stale DNS overrides" | |
| systemd: | |
| name: NetworkManager | |
| state: reloaded | |
| - name: "[clean] Reload NetworkManager to clear stale DNS overrides" | |
| systemd: | |
| name: NetworkManager | |
| state: restarted |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`
around lines 127 - 130, The cleanup task still only reloads NetworkManager, so
the old dnsmasq state may survive and keep stale DNS overrides. Update the
cleanup step in the NetworkManager reset flow to use a full restart after
removing `{{ sno_nm_dnsmasq_conf }}`, matching the behavior used in
`update-dns.yml`; keep the change in the `clean.yml` task that manages
NetworkManager so the stale `api`/`api-int`/`*.apps` answers are fully cleared.
| - name: "[create-vms] Cleanup stale VMs" | ||
| shell: | | ||
| sudo virsh destroy {{ item.name }} 2>/dev/null || true | ||
| sudo virsh undefine {{ item.name }} --remove-all-storage 2>/dev/null || true | ||
| sudo rm -f {{ sno_vm_image_dir }}/{{ item.name }}.qcow2 | ||
| loop: "{{ sno_new_nodes }}" | ||
| failed_when: false | ||
| changed_when: true |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/usr/bin/bash
set -euo pipefail
rg -n -C3 'virsh undefine|loader=.*/OVMF|--nvram' deploy/openshift-clusters/roles/mutable-topology/sno-to-3nodeRepository: openshift-eng/two-node-toolbox
Length of output: 4523
🌐 Web query:
libvirt virsh undefine --nvram OVMF UEFI domain behavior
💡 Result:
When using the virsh undefine command on a domain configured with UEFI (OVMF) firmware, the libvirt driver requires explicit instruction regarding the existing NVRAM variable store file [1][2]. If a domain has an associated NVRAM file and neither the --nvram nor the --keep-nvram flag is provided, the undefine operation will fail with an error such as "Requested operation is not valid: cannot undefine domain with nvram" [3][2][4]. The behavior of these flags is as follows: 1. --nvram: This flag instructs libvirt to delete the NVRAM variable store file (typically defined in the domain XML under /domain/os/nvram) upon undefining the domain [2][4]. This action is performed regardless of whether the domain is currently active or inactive, ensuring that the backing file is removed from the host's filesystem [5][2][6]. 2. --keep-nvram: This flag instructs libvirt to remove the domain's configuration from libvirt management while preserving the existing NVRAM file on the host's filesystem [2][4]. This mechanism was implemented to prevent the accidental loss of UEFI state (such as boot order, secure boot keys, or other NVRAM-stored variables) when a virtual machine is undefined, while also providing a safe mechanism for cleanup [1][2][7]. If you encounter difficulty deleting a UEFI-based VM, explicitly including --nvram in your command is the standard resolution [8][9].
Citations:
- 1: https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/EGHMTE36NCL52DM675EQJ2D5QJ6MIIZQ/
- 2: https://bugzilla.redhat.com/show_bug.cgi?id=1522706
- 3: https://bugzilla.redhat.com/show_bug.cgi?id=1751596
- 4: https://www.spinics.net/linux/fedora/libvir/msg257961.html
- 5: https://bugzilla.redhat.com/show_bug.cgi?id=2007659
- 6: https://bugzilla.redhat.com/show_bug.cgi?id=1467245
- 7: https://patchew.org/Libvirt/20191015083122.21217-1-pmores@redhat.com/
- 8: https://platform9.com/kb/pmo/solution/instance-booted-via-uefi-unable-to-be-deleted
- 9: Failed to delete a UEFI/OVMF Libvirt machine cockpit-project/cockpit#8217
🏁 Script executed:
sed -n '50,75p' deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.ymlRepository: openshift-eng/two-node-toolbox
Length of output: 1159
Add --nvram to the stale-VM undefine. virsh undefine needs it for OVMF guests, and without it the stale domain/NVRAM state can block the later VM create. deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml:5
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.yml`
around lines 2 - 9, The stale VM cleanup in the create-vms task is missing the
OVMF-specific NVRAM removal flag, which can leave domain state behind and
interfere with recreating the VM. Update the `virsh undefine` call in the
`"[create-vms] Cleanup stale VMs"` task to include `--nvram`, keeping the rest
of the cleanup flow in `create-vms.yml` unchanged.
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout | ||
| changed_when: true | ||
|
|
||
| - name: "[preflight] Verify api-int DNS after fix" | ||
| shell: | | ||
| dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1 | ||
| register: dns_apiint_verify | ||
| changed_when: false | ||
| failed_when: sno_master0_ip not in dns_apiint_verify.stdout | ||
| retries: 3 | ||
| delay: 2 | ||
| until: sno_master0_ip in dns_apiint_verify.stdout | ||
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Match DNS answers exactly, not by substring.
in dns_apiint.stdout can accept the wrong IP when one address is a prefix of another. Compare against trimmed stdout_lines instead.
Proposed fix
- when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout
+ when: dns_apiint.stdout | trim == "" or sno_master0_ip not in (dns_apiint.stdout_lines | map('trim') | list)
@@
- failed_when: sno_master0_ip not in dns_apiint_verify.stdout
+ failed_when: sno_master0_ip not in (dns_apiint_verify.stdout_lines | map('trim') | list)
@@
- until: sno_master0_ip in dns_apiint_verify.stdout
+ until: sno_master0_ip in (dns_apiint_verify.stdout_lines | map('trim') | list)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout | |
| changed_when: true | |
| - name: "[preflight] Verify api-int DNS after fix" | |
| shell: | | |
| dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1 | |
| register: dns_apiint_verify | |
| changed_when: false | |
| failed_when: sno_master0_ip not in dns_apiint_verify.stdout | |
| retries: 3 | |
| delay: 2 | |
| until: sno_master0_ip in dns_apiint_verify.stdout | |
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in dns_apiint.stdout | |
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in (dns_apiint.stdout_lines | map('trim') | list) | |
| changed_when: true | |
| - name: "[preflight] Verify api-int DNS after fix" | |
| shell: | | |
| dig +short api-int.{{ sno_cluster_domain }} `@192.168.111.1` | |
| register: dns_apiint_verify | |
| changed_when: false | |
| failed_when: sno_master0_ip not in (dns_apiint_verify.stdout_lines | map('trim') | list) | |
| retries: 3 | |
| delay: 2 | |
| until: sno_master0_ip in (dns_apiint_verify.stdout_lines | map('trim') | list) | |
| when: dns_apiint.stdout | trim == "" or sno_master0_ip not in (dns_apiint.stdout_lines | map('trim') | list) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`
around lines 155 - 167, The DNS verification in the preflight task is using
substring matching, which can incorrectly accept a different IP when one answer
prefixes another. Update the checks in the preflight shell task and the
`dns_apiint_verify` validation to compare against trimmed `stdout_lines` entries
instead of `in dns_apiint.stdout`, so the `dig` result for `api-int` must match
the exact `sno_master0_ip` value.
| - name: "[verify] Get cluster operator summary" | ||
| shell: | | ||
| oc get co -o json | python3 -c " | ||
| import json, sys | ||
| data = json.load(sys.stdin) | ||
| total = len(data['items']) | ||
| degraded = [] | ||
| unavailable = [] | ||
| for co in data['items']: | ||
| name = co['metadata']['name'] | ||
| conds = {c['type']: c['status'] for c in co.get('status', {}).get('conditions', [])} | ||
| if conds.get('Degraded') == 'True': | ||
| degraded.append(name) | ||
| if conds.get('Available') == 'False': | ||
| unavailable.append(name) | ||
| print(f'Total COs: {total}') | ||
| if degraded: | ||
| print(f'Degraded: {chr(44).join(degraded)}') | ||
| else: | ||
| print('Degraded: none') | ||
| if unavailable: | ||
| print(f'Unavailable: {chr(44).join(unavailable)}') | ||
| else: | ||
| print('Unavailable: none') | ||
| " | ||
| environment: "{{ sno_oc_env }}" | ||
| register: verify_co | ||
| changed_when: false | ||
|
|
||
| - name: "[verify] Cluster operators" | ||
| debug: | ||
| msg: "{{ verify_co.stdout }}" |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Fail the migration when cluster operators are still degraded or unavailable.
This block only prints the degraded/unavailable operator list, then the role still emits MIGRATION COMPLETE. Since mco-rollout.yml is best-effort and main.yml ends after this phase, there is no later hard gate that proves post-transition convergence.
Proposed fix
- name: "[verify] Cluster operators"
debug:
msg: "{{ verify_co.stdout }}"
+- name: "[verify] Assert cluster operators healthy"
+ assert:
+ that:
+ - "'Degraded: none' in verify_co.stdout"
+ - "'Unavailable: none' in verify_co.stdout"
+ fail_msg: "{{ verify_co.stdout }}"
+
- name: "[verify] Get MCP master status"
shell: |
oc get mcp master --no-headersAlso applies to: 111-116
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`
around lines 47 - 78, The verification step in the cluster operator summary only
reports degraded/unavailable operators and never blocks completion, so the
migration can still finish in a bad state. Update the verify task that runs the
oc get co summary and stores it in verify_co to explicitly fail when any
operator is Degraded or Available=False, using the existing degraded and
unavailable lists as the gate, so the role stops before emitting MIGRATION
COMPLETE.
Add a new Ansible role and make target to perform a one-shot migration of a platform:None SNO cluster to a 3-node HA cluster. The role creates two new control-plane VMs, boots them with RHCOS via coreos-installer ISO ignition embed, waits for nodes to join and etcd to scale, then transitions the topology to HighlyAvailable. Usage: make sno-to-3node Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ehila <ehila@redhat.com>
e1bf815 to
f8eef94
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml (1)
127-132: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winStill reloading instead of restarting NetworkManager.
update-dns.ymlrestarts NetworkManager to applysno_nm_dnsmasq_conf, but this cleanup step only reloads (state: reloaded) after deleting the file, which was previously flagged as insufficient to guarantee the embedded dnsmasq process drops the stale config.♻️ Proposed fix
- name: "[clean] Reload NetworkManager to clear stale DNS overrides" systemd: name: NetworkManager - state: reloaded + state: restarted become: true failed_when: false🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml` around lines 127 - 132, The cleanup task still uses a NetworkManager reload, which is not enough to ensure the embedded dnsmasq process drops the removed DNS override. Update the task named “[clean] Reload NetworkManager to clear stale DNS overrides” in clean.yml to perform a restart via the systemd task instead of a reload, matching the behavior used in update-dns.yml and keeping the existing become and failed_when handling.deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml (1)
47-90: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winCluster operator and MCP health are still not gated before declaring success.
The
[verify] Cluster operatorsand[verify] MCP statusblocks only printverify_co.stdout/verify_mcp.stdout; there is noassertthat fails the play when operators are Degraded/Unavailable or the MCP is not Updated.main.ymltreatsverify.ymlas the terminal "Phase 9" step, so the role can still report a successful migration with degraded operators or a stuck MCP rollout.♻️ Proposed fix
- name: "[verify] Cluster operators" debug: msg: "{{ verify_co.stdout }}" +- name: "[verify] Assert cluster operators healthy" + assert: + that: + - "'Degraded: none' in verify_co.stdout" + - "'Unavailable: none' in verify_co.stdout" + fail_msg: "{{ verify_co.stdout }}" + - name: "[verify] Get MCP master status" shell: | oc get mcp master --no-headers environment: "{{ sno_oc_env }}" register: verify_mcp changed_when: false - name: "[verify] MCP status" debug: msg: "{{ verify_mcp.stdout }}" + +- name: "[verify] Assert MCP updated" + assert: + that: "'True' in verify_mcp.stdout" + fail_msg: "MCP master not fully updated:\n{{ verify_mcp.stdout }}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml` around lines 47 - 90, The verify tasks currently only report Cluster Operator and MCP output, but do not fail the play when health is bad. Update the verify flow in the existing `verify_co`/`verify_mcp` checks so it asserts no operators are Degraded or Unavailable and that the master MCP is fully updated/ready before success is declared. Use the existing `verify_co` and `verify_mcp` registered results to add explicit failure conditions after the `"[verify] Cluster operators"` and `"[verify] MCP status"` steps.
🧹 Nitpick comments (3)
deploy/openshift-clusters/scripts/sno-to-3node.sh (1)
1-1: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueShebang should be
#!/usr/bin/bash.As per coding guidelines,
**/*.shfiles "should use#!/usr/bin/bashshebang". This script uses#!/bin/bash.diff
-#!/bin/bash +#!/usr/bin/bash🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/sno-to-3node.sh` at line 1, Update the script header to use the required bash interpreter path: this file currently starts with the wrong shebang, so replace the existing `#!/bin/bash` at the top of `sno-to-3node.sh` with `#!/usr/bin/bash` to match the shell script guideline.Source: Coding guidelines
deploy/openshift-clusters/scripts/clean-mutable-topology.sh (2)
1-22: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicate logic with
sno-to-3node.sh.This script is identical to
deploy/openshift-clusters/scripts/sno-to-3node.shaside from the playbook filename. Consider extracting the shared instance/inventory validation and invocation logic into a common helper (e.g., alongsidecommon.sh) to avoid divergence over time.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` around lines 1 - 22, The validation and ansible invocation logic in clean-mutable-topology.sh is duplicated in sno-to-3node.sh, so extract the shared checks and playbook runner into a common helper sourced by both scripts. Move the instance-dir and inventory.ini validation plus the cd/ansible-playbook flow into a reusable function in common.sh or a nearby shared script, and keep only the playbook-specific filename in each wrapper. Use the existing get_node_dir and DEPLOY_DIR setup to preserve behavior while removing the duplicate shell logic.
1-1: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueShebang should be
#!/usr/bin/bash.As per coding guidelines,
**/*.shfiles "should use#!/usr/bin/bashshebang". This script uses#!/bin/bash.diff
-#!/bin/bash +#!/usr/bin/bash🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh` at line 1, The script uses the wrong shebang and should follow the shell script guideline by switching the top-of-file interpreter line in clean-mutable-topology.sh from the current bash path to usr/bin/bash. Update the shebang only, keeping the rest of the script unchanged.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deploy/openshift-clusters/clean-mutable-topology.yml`:
- Around line 16-39: The cleanup probes are rebuilding KUBECONFIG from
dev_scripts_path and sno_cluster_name instead of honoring the sno_kubeconfig
override, which can point them at the wrong cluster. Update the two shell probes
in the cleanup workflow to use sno_kubeconfig when it is defined, and only fall
back to the existing computed kubeconfig path when the override is absent. Keep
the existing detection logic in the domain and topology steps, but make sure
both oc invocations reference the same kubeconfig source consistently.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`:
- Around line 122-124: The preflight DNS verification is hardcoded to query a
fixed dnsmasq server, which can point at the wrong libvirt network. Update the
`dig` checks in the preflight tasks to use the DNS server associated with `{{
sno_libvirt_network }}` instead of `@192.168.111.1`, and make the same change in
both the `api-int` and related verification steps so the check matches the
repair path.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2`:
- Around line 8-12: The `/etc/master.ign` entry in `auto-install.ign.j2` is too
permissive for rendered ignition content that may include sensitive
control-plane credentials. Update the file mode for this object to restrict
access to the installer service only, keeping the change localized to the
`/etc/master.ign` stanza and preserving the existing `contents.source` handling.
---
Duplicate comments:
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`:
- Around line 127-132: The cleanup task still uses a NetworkManager reload,
which is not enough to ensure the embedded dnsmasq process drops the removed DNS
override. Update the task named “[clean] Reload NetworkManager to clear stale
DNS overrides” in clean.yml to perform a restart via the systemd task instead of
a reload, matching the behavior used in update-dns.yml and keeping the existing
become and failed_when handling.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.yml`:
- Around line 47-90: The verify tasks currently only report Cluster Operator and
MCP output, but do not fail the play when health is bad. Update the verify flow
in the existing `verify_co`/`verify_mcp` checks so it asserts no operators are
Degraded or Unavailable and that the master MCP is fully updated/ready before
success is declared. Use the existing `verify_co` and `verify_mcp` registered
results to add explicit failure conditions after the `"[verify] Cluster
operators"` and `"[verify] MCP status"` steps.
---
Nitpick comments:
In `@deploy/openshift-clusters/scripts/clean-mutable-topology.sh`:
- Around line 1-22: The validation and ansible invocation logic in
clean-mutable-topology.sh is duplicated in sno-to-3node.sh, so extract the
shared checks and playbook runner into a common helper sourced by both scripts.
Move the instance-dir and inventory.ini validation plus the cd/ansible-playbook
flow into a reusable function in common.sh or a nearby shared script, and keep
only the playbook-specific filename in each wrapper. Use the existing
get_node_dir and DEPLOY_DIR setup to preserve behavior while removing the
duplicate shell logic.
- Line 1: The script uses the wrong shebang and should follow the shell script
guideline by switching the top-of-file interpreter line in
clean-mutable-topology.sh from the current bash path to usr/bin/bash. Update the
shebang only, keeping the rest of the script unchanged.
In `@deploy/openshift-clusters/scripts/sno-to-3node.sh`:
- Line 1: Update the script header to use the required bash interpreter path:
this file currently starts with the wrong shebang, so replace the existing
`#!/bin/bash` at the top of `sno-to-3node.sh` with `#!/usr/bin/bash` to match
the shell script guideline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 5541d134-f639-4040-b324-49e50d05bf1c
📒 Files selected for processing (23)
deploy/Makefiledeploy/openshift-clusters/clean-mutable-topology.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/handlers/main.ymldeploy/openshift-clusters/roles/dev-scripts/install-dev/tasks/config.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/defaults/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/create-vms.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/ignition.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/main.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/mco-rollout.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/topology.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/update-dns.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/verify.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-etcd.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/wait-nodes.ymldeploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/master.ign.j2deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/vars/main.ymldeploy/openshift-clusters/scripts/clean-mutable-topology.shdeploy/openshift-clusters/scripts/sno-to-3node.shdeploy/openshift-clusters/sno-to-3node.yml
| - name: Detect cluster domain from kubeconfig (best-effort) | ||
| shell: | | ||
| KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \ | ||
| oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \ | ||
| | sed 's|^https://api-int\.||; s|:6443$||' | ||
| register: detected_domain | ||
| changed_when: false | ||
| failed_when: false | ||
| ignore_errors: true | ||
|
|
||
| - name: Set cluster domain (use detected, variable override, or default) | ||
| set_fact: | ||
| sno_cluster_domain: >- | ||
| {{ sno_cluster_domain | ||
| if (sno_cluster_domain is defined and sno_cluster_domain) | ||
| else (detected_domain.stdout | trim | ||
| if (detected_domain.stdout is defined and detected_domain.stdout | trim) | ||
| else 'ostest.test.metalkube.org') }} | ||
|
|
||
| - name: Check live cluster topology | ||
| shell: | | ||
| KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \ | ||
| oc get infrastructure cluster \ | ||
| -o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown" |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Honor sno_kubeconfig in the cleanup probes.
Both oc probes rebuild KUBECONFIG from dev_scripts_path/sno_cluster_name, but this role now exposes sno_kubeconfig as the override for non-standard installs. When that override is used, domain detection falls back and topology detection degrades to Unknown, so this cleanup path can make decisions against the wrong cluster state.
Suggested fix
+ - name: Resolve kubeconfig path for mutable-topology checks
+ ansible.builtin.set_fact:
+ effective_sno_kubeconfig: >-
+ {{ sno_kubeconfig
+ if (sno_kubeconfig is defined and sno_kubeconfig)
+ else (dev_scripts_path | default('openshift-metal3/dev-scripts')) ~ '/ocp/' ~ (sno_cluster_name | default('ostest')) ~ '/auth/kubeconfig' }}
+
- name: Detect cluster domain from kubeconfig (best-effort)
shell: |
- KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
+ KUBECONFIG={{ effective_sno_kubeconfig | quote }} \
oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \
| sed 's|^https://api-int\.||; s|:6443$||'
@@
- name: Check live cluster topology
shell: |
- KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
+ KUBECONFIG={{ effective_sno_kubeconfig | quote }} \
oc get infrastructure cluster \
-o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Detect cluster domain from kubeconfig (best-effort) | |
| shell: | | |
| KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \ | |
| oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \ | |
| | sed 's|^https://api-int\.||; s|:6443$||' | |
| register: detected_domain | |
| changed_when: false | |
| failed_when: false | |
| ignore_errors: true | |
| - name: Set cluster domain (use detected, variable override, or default) | |
| set_fact: | |
| sno_cluster_domain: >- | |
| {{ sno_cluster_domain | |
| if (sno_cluster_domain is defined and sno_cluster_domain) | |
| else (detected_domain.stdout | trim | |
| if (detected_domain.stdout is defined and detected_domain.stdout | trim) | |
| else 'ostest.test.metalkube.org') }} | |
| - name: Check live cluster topology | |
| shell: | | |
| KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \ | |
| oc get infrastructure cluster \ | |
| -o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown" | |
| - name: Resolve kubeconfig path for mutable-topology checks | |
| ansible.builtin.set_fact: | |
| effective_sno_kubeconfig: >- | |
| {{ sno_kubeconfig | |
| if (sno_kubeconfig is defined and sno_kubeconfig) | |
| else (dev_scripts_path | default('openshift-metal3/dev-scripts')) ~ '/ocp/' ~ (sno_cluster_name | default('ostest')) ~ '/auth/kubeconfig' }} | |
| - name: Detect cluster domain from kubeconfig (best-effort) | |
| shell: | | |
| KUBECONFIG={{ effective_sno_kubeconfig | quote }} \ | |
| oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \ | |
| | sed 's|^https://api-int\.||; s|:6443$||' | |
| register: detected_domain | |
| changed_when: false | |
| failed_when: false | |
| ignore_errors: true | |
| - name: Set cluster domain (use detected, variable override, or default) | |
| set_fact: | |
| sno_cluster_domain: >- | |
| {{ sno_cluster_domain | |
| if (sno_cluster_domain is defined and sno_cluster_domain) | |
| else (detected_domain.stdout | trim | |
| if (detected_domain.stdout is defined and detected_domain.stdout | trim) | |
| else 'ostest.test.metalkube.org') }} | |
| - name: Check live cluster topology | |
| shell: | | |
| KUBECONFIG={{ effective_sno_kubeconfig | quote }} \ | |
| oc get infrastructure cluster \ | |
| -o jsonpath='{.status.controlPlaneTopology}' 2>/dev/null || echo "Unknown" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 16 - 39,
The cleanup probes are rebuilding KUBECONFIG from dev_scripts_path and
sno_cluster_name instead of honoring the sno_kubeconfig override, which can
point them at the wrong cluster. Update the two shell probes in the cleanup
workflow to use sno_kubeconfig when it is defined, and only fall back to the
existing computed kubeconfig path when the override is absent. Keep the existing
detection logic in the domain and topology steps, but make sure both oc
invocations reference the same kubeconfig source consistently.
| - name: "[preflight] Verify api-int DNS on hypervisor" | ||
| shell: | | ||
| dig +short api-int.{{ sno_cluster_domain }} @192.168.111.1 |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Stop querying a fixed DNS server here.
Both dig checks hardcode @192.168.111.1, but the repair path updates {{ sno_libvirt_network }}. On any non-default libvirt network, preflight can inspect the wrong dnsmasq instance and either repair unnecessarily or fail verification after a successful virsh net-update.
Also applies to: 158-160
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/preflight.yml`
around lines 122 - 124, The preflight DNS verification is hardcoded to query a
fixed dnsmasq server, which can point at the wrong libvirt network. Update the
`dig` checks in the preflight tasks to use the DNS server associated with `{{
sno_libvirt_network }}` instead of `@192.168.111.1`, and make the same change in
both the `api-int` and related verification steps so the check matches the
repair path.
| "path": "/etc/master.ign", | ||
| "mode": 420, | ||
| "contents": { | ||
| "source": "data:text/plain;charset=utf-8;base64,{{ sno_master_ign_b64 }}" | ||
| } |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win
Tighten /etc/master.ign permissions.
master.ign can carry the rendered control-plane ignition, including pull-secret and cluster credential material. 0644 makes it world-readable in the live ISO environment for no benefit; 0600 is enough for the installer service.
Suggested fix
- "mode": 420,
+ "mode": 384,As per coding guidelines, "Handle secrets and credentials securely."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "path": "/etc/master.ign", | |
| "mode": 420, | |
| "contents": { | |
| "source": "data:text/plain;charset=utf-8;base64,{{ sno_master_ign_b64 }}" | |
| } | |
| "path": "/etc/master.ign", | |
| "mode": 384, | |
| "contents": { | |
| "source": "data:text/plain;charset=utf-8;base64,{{ sno_master_ign_b64 }}" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/templates/auto-install.ign.j2`
around lines 8 - 12, The `/etc/master.ign` entry in `auto-install.ign.j2` is too
permissive for rendered ignition content that may include sensitive
control-plane credentials. Update the file mode for this object to restrict
access to the installer service only, keeping the change localized to the
`/etc/master.ign` stanza and preserving the existing `contents.source` handling.
Source: Coding guidelines
f8eef94 to
c5d09b6
Compare
pull rhcos-live.iso from payload if not present install coreos-installer if not present Signed-off-by: ehila <ehila@redhat.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml (1)
80-81: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winKeep
virsh undefinealigned with UEFI/NVRAM guests.This is still missing
--nvram; with the OVMF varstore configured byvirt-install, undefine can fail and leave the old domain behind before the next create.Suggested fix
sudo virsh destroy "$VM_NAME" 2>/dev/null || true - sudo virsh undefine "$VM_NAME" 2>/dev/null || true + sudo virsh undefine "$VM_NAME" --nvram 2>/dev/null || true🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml` around lines 80 - 81, Update the cleanup in the boot-nodes task so `virsh undefine` handles UEFI/NVRAM guests created by `virt-install`; specifically, in the `boot-nodes.yml` teardown block where `sudo virsh destroy` and `sudo virsh undefine` run, add the NVRAM-aware option to the `virsh undefine` call so the old domain is fully removed before recreation.
🧹 Nitpick comments (1)
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml (1)
60-68: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPreserve idempotent task status for the installer check.
This task reports changed even when
coreos-installeris already installed. Addchanged_whenso reruns don’t show a false mutation.Suggested fix
- name: "[boot] Ensure coreos-installer is available" shell: | if ! command -v coreos-installer &>/dev/null; then echo "coreos-installer not found, installing via dnf..." sudo dnf install -y coreos-installer else echo "coreos-installer already installed: $(command -v coreos-installer)" fi + register: coreos_installer_check + changed_when: "'installing via dnf' in coreos_installer_check.stdout"As per coding guidelines, deployments should be idempotent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml` around lines 60 - 68, The boot-nodes.yml task that checks for coreos-installer is reporting changed even on no-op reruns. Update the shell task named "[boot] Ensure coreos-installer is available" to set changed_when so it only reports changed when the dnf install branch runs, and stays unchanged when command -v coreos-installer already succeeds. Make sure the task remains idempotent while preserving the current install behavior.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`:
- Around line 80-81: Update the cleanup in the boot-nodes task so `virsh
undefine` handles UEFI/NVRAM guests created by `virt-install`; specifically, in
the `boot-nodes.yml` teardown block where `sudo virsh destroy` and `sudo virsh
undefine` run, add the NVRAM-aware option to the `virsh undefine` call so the
old domain is fully removed before recreation.
---
Nitpick comments:
In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml`:
- Around line 60-68: The boot-nodes.yml task that checks for coreos-installer is
reporting changed even on no-op reruns. Update the shell task named "[boot]
Ensure coreos-installer is available" to set changed_when so it only reports
changed when the dnf install branch runs, and stays unchanged when command -v
coreos-installer already succeeds. Make sure the task remains idempotent while
preserving the current install behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2c26fbf7-9252-4826-94ce-3ec2d575b695
📒 Files selected for processing (1)
deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/boot-nodes.yml
Summary by CodeRabbit
New Features
Bug Fixes