HYPERFLEET-1262 - fix: cap gRPC backoff and add retry for Maestro recovery#219
HYPERFLEET-1262 - fix: cap gRPC backoff and add retry for Maestro recovery#219kuudori wants to merge 2 commits into
Conversation
…overy - Cap gRPC reconnect backoff at 5s via ExtraDialOpts/ConnectParams (upstream sdk-go PR #230 merged) - Add bounded retry (3 attempts, exponential backoff) for transient gRPC errors (codes.Unavailable) in Create/Get/Patch ManifestWork - Log structured warning in AlwaysAck before discarding errors
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughSummary by CodeRabbit
Walkthrough
Estimated code review effort: 4 (Complex) | ~45 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Risk Score: 4 —
|
| Signal | Detail | Points |
|---|---|---|
| PR size | 340 lines (>200) | +1 |
| Sensitive paths | cmd/ | +2 |
| Test coverage | Missing tests for: cmd/adapter | +1 |
Computed by hyperfleet-risk-scorer
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@go.mod`:
- Line 3: The module Go version is out of sync with the repo’s pinned baseline
and toolchain pins. Update the go directive in go.mod to match the repo’s
established Go 1.25.x baseline, or if the higher version is intentional, make
the corresponding AGENTS.md and Bingo tool module pins consistent in the same
change. Keep the version aligned across go.mod, AGENTS.md, and the pinned tool
modules so the baseline remains reproducible.
In `@internal/executor/handler.go`:
- Around line 60-64: The warning path in handler.go currently logs result.Errors
verbatim in the result.Status == StatusFailed branch, which can leak sensitive
upstream payloads. Update the logging in the event handler failure path (around
the errCtx/result handling in the handler function) to stop attaching the full
errors map; instead log only safe identifiers like the phase/step and a
redacted, high-level failure message. Use the existing logger.WithLogFields and
log.Warn call sites to keep context without exposing raw error text.
In `@internal/maestroclient/operations.go`:
- Around line 38-55: The retryOnTransientGRPC helper in Client is swallowing the
last transient gRPC failure by returning false, nil from the
ExponentialBackoffWithContext callback, which causes callers to see a generic
timeout instead of the real error. Update retryOnTransientGRPC to keep track of
the most recent transient error from fn, continue retrying on transient errors
as today, and when backoff is exhausted return that stored error instead of nil
so the original codes.Unavailable is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e22f577b-384e-4bf3-96f7-aa9feda0dbec
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum,!**/go.sum
📒 Files selected for processing (8)
cmd/adapter/main.gogo.modinternal/executor/executor_test.gointernal/executor/handler.gointernal/maestroclient/client.gointernal/maestroclient/operations.gointernal/maestroclient/operations_test.gotest/integration/executor/executor_integration_test.go
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift-hyperfleet/architecture(manual)openshift-hyperfleet/hyperfleet-api(manual)openshift-hyperfleet/hyperfleet-sentinel(manual)openshift-hyperfleet/hyperfleet-adapter(manual)openshift-hyperfleet/hyperfleet-broker(manual)
| module github.com/openshift-hyperfleet/hyperfleet-adapter | ||
|
|
||
| go 1.25.0 | ||
| go 1.26.0 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== go.mod =='
sed -n '1,12p' go.mod
echo
echo '== Go/toolchain pins across the repo =='
rg -n --hidden --glob '!**/vendor/**' \
'^(go [0-9]+\.[0-9]+(\.[0-9]+)?|toolchain go[0-9]+\.[0-9]+(\.[0-9]+)?)|golang:[0-9]+\.[0-9]+|GO_VERSION|go-version:' \
go.mod AGENTS.md .tekton Dockerfile .github Makefile .tool-versions 2>/dev/null || trueRepository: openshift-hyperfleet/hyperfleet-adapter
Length of output: 641
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== top-level files =='
git ls-files | sed -n '1,200p'
echo
echo '== any Go/toolchain version pins in tracked files =='
rg -n --hidden --glob '!**/vendor/**' \
'^(go [0-9]+\.[0-9]+(\.[0-9]+)?|toolchain go[0-9]+\.[0-9]+(\.[0-9]+)?|FROM .*golang:[0-9]+\.[0-9]+|GOVERSION=|GO_VERSION=|go-version:)' \
. || true
echo
echo '== go.mod and any go.work =='
sed -n '1,40p' go.mod
[ -f go.work ] && { echo; sed -n '1,40p' go.work; } || true
echo
echo '== CI / build files mentioning Go =='
rg -n --hidden --glob '!**/vendor/**' 'go 1\.[0-9]+|golang:[0-9]+\.[0-9]+|toolchain|GOVERSION|GO_VERSION' \
.github .tekton Makefile Dockerfile* .tool-versions configs docs 2>/dev/null || trueRepository: openshift-hyperfleet/hyperfleet-adapter
Length of output: 8218
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== architecture review guidance for Go/toolchain or CI pinning =='
rg -n --hidden --glob '!**/vendor/**' \
'Go 1\.[0-9]+|toolchain|CI|Konflux|Tekton|Go version|CWE-829|supply chain' \
hyperfleet/standards/code-review docs . || trueRepository: openshift-hyperfleet/architecture
Length of output: 50389
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== Dockerfile =='
sed -n '1,220p' Dockerfile
echo
echo '== Makefile version/toolchain references =='
rg -n 'go1\.[0-9]+|golang:[0-9]+\.[0-9]+|GOVERSION|GO_VERSION|toolchain|VERSION ?=' Makefile .bingo/Variables.mk .tekton/*.yaml 2>/dev/null || true
echo
echo '== .tekton files mentioning Go/toolchain =='
for f in .tekton/*.yaml; do
[ -f "$f" ] || continue
echo "--- $f ---"
rg -n 'go1\.[0-9]+|golang:[0-9]+\.[0-9]+|GOVERSION|GO_VERSION|toolchain|VERSION ?=' "$f" || true
done
echo
echo '== AGENTS.md baseline references =='
rg -n 'Go 1\.[0-9]+|toolchain|baseline|version' AGENTS.mdRepository: openshift-hyperfleet/hyperfleet-adapter
Length of output: 3227
Align go.mod with the repo’s pinned Go baseline (go.mod:3)
go 1.26.0 drifts from AGENTS.md (Go 1.25.0) and the pinned Bingo tool modules (.bingo/golangci-lint.mod, .bingo/goimports.mod are still on 1.25.0). That’s a reproducibility/supply-chain drift (CWE-829); keep the module on 1.25.x or land the matching toolchain/CI pin updates in the same PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@go.mod` at line 3, The module Go version is out of sync with the repo’s
pinned baseline and toolchain pins. Update the go directive in go.mod to match
the repo’s established Go 1.25.x baseline, or if the higher version is
intentional, make the corresponding AGENTS.md and Bingo tool module pins
consistent in the same change. Keep the version aligned across go.mod,
AGENTS.md, and the pinned tool modules so the baseline remains reproducible.
Source: Coding guidelines
- Bump go-toolset image to 1.26 in Dockerfile and test/Dockerfile.integration - Log only failed phase names in AlwaysAck instead of full error map - Preserve last transient gRPC error on retry exhaustion instead of returning generic wait timeout - Use wait.ExponentialBackoffWithContext for retry logic
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
Dockerfile (1)
3-3: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRepo docs still say Go 1.25.0 — update alongside the 1.26 bump.
The builder image now tracks Go 1.26.x, matching
go.mod'sgo 1.26.0, but the repo's AGENTS.md/README guidance still references "Go 1.25.0" as the toolchain. Update that doc to avoid confusing contributors about the supported version.As per coding guidelines: "Go 1.25.0 · Cobra CLI · Viper config · golangci-lint (bingo-managed) · Tekton CI (Konflux)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Dockerfile` at line 3, Update the contributor docs to match the new Go toolchain version: the Dockerfile now uses Go 1.26.x and go.mod already targets go 1.26.0, so revise the stale “Go 1.25.0” references in AGENTS.md/README to avoid conflicting guidance. Locate the version mention in the repo docs and change the toolchain wording so it consistently reflects Go 1.26.x across the project.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Dockerfile`:
- Line 3: The builder stage in Dockerfile still uses a mutable Red Hat
go-toolset tag, so update the FROM instruction to reference a fixed `@sha256`
digest instead of the tag. Use the existing builder image reference to locate it
and pin the exact image digest for reproducible, tamper-resistant builds,
matching the same digest-pinning approach used for other base images in the
Dockerfile.
---
Nitpick comments:
In `@Dockerfile`:
- Line 3: Update the contributor docs to match the new Go toolchain version: the
Dockerfile now uses Go 1.26.x and go.mod already targets go 1.26.0, so revise
the stale “Go 1.25.0” references in AGENTS.md/README to avoid conflicting
guidance. Locate the version mention in the repo docs and change the toolchain
wording so it consistently reflects Go 1.26.x across the project.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 17d8e77d-b81d-436f-84e3-ea39b7ffdaa6
📒 Files selected for processing (4)
Dockerfileinternal/executor/handler.gointernal/maestroclient/operations.gotest/Dockerfile.integration
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift-hyperfleet/architecture(manual)openshift-hyperfleet/hyperfleet-api(manual)openshift-hyperfleet/hyperfleet-sentinel(manual)openshift-hyperfleet/hyperfleet-adapter(manual)openshift-hyperfleet/hyperfleet-broker(manual)
✅ Files skipped from review due to trivial changes (1)
- test/Dockerfile.integration
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/executor/handler.go
- internal/maestroclient/operations.go
|
/test unit |
|
/test lint |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What
Fix ~95s gRPC recovery delay after Maestro restart. Three changes:
Cap gRPC reconnect backoff at 5s — configure
grpc.WithConnectParamsvia newExtraDialOptsfield on OCM SDK'sGRPCDialer(upstream PR merged). Reduces worst-case reconnect delay from 120s to 5s.Bounded retry for transient gRPC errors — retry
Create/Get/PatchManifestWork calls up to 3 times with exponential backoff (1s, 2s, 4s) oncodes.Unavailable. Usesk8s.io/apimachinery/pkg/util/wait.ExponentialBackoffWithContext. Retry wraps rawworkClientcalls inside each method (beforeapperrors.MaestroErrorwraps the error, preserving gRPC status codes).Log warnings in AlwaysAck — structured warning with event ID, type, and error details before discarding errors. Makes silent publish failures visible in adapter logs.
Why
Maestro restart causes gRPC connection to enter
TRANSIENT_FAILUREwith grpc-go defaultMaxDelay=120s. HTTP path recovers instantly (per-request), but gRPC publish path stays in backoff. The adapter silently drops every failed publish viaAlwaysAck, so operators see "no progress" instead of "Maestro publish failing". Breaks tier2-nightlymaestro_unavailabilityrecovery test (120s window).Testing
isTransientGRPCError(6 cases) andretryOnTransientGRPC(5 scenarios: success, retry-success, max-attempts, non-transient-fast-fail, context-cancellation)AlwaysAcktests updated for new logger parametermake test+make lintpassTest plan
make test— unit tests passmake lint— 0 issuesmake build— binary buildsmaestro_unavailabilitytest passes 5x consistently