Skip to content

HYPERFLEET-1262 - fix: cap gRPC backoff and add retry for Maestro recovery#219

Open
kuudori wants to merge 2 commits into
openshift-hyperfleet:mainfrom
kuudori:HYPERFLEET-1262-grpc-recovery-delay
Open

HYPERFLEET-1262 - fix: cap gRPC backoff and add retry for Maestro recovery#219
kuudori wants to merge 2 commits into
openshift-hyperfleet:mainfrom
kuudori:HYPERFLEET-1262-grpc-recovery-delay

Conversation

@kuudori

@kuudori kuudori commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What

Fix ~95s gRPC recovery delay after Maestro restart. Three changes:

  1. Cap gRPC reconnect backoff at 5s — configure grpc.WithConnectParams via new ExtraDialOpts field on OCM SDK's GRPCDialer (upstream PR merged). Reduces worst-case reconnect delay from 120s to 5s.

  2. Bounded retry for transient gRPC errors — retry Create/Get/Patch ManifestWork calls up to 3 times with exponential backoff (1s, 2s, 4s) on codes.Unavailable. Uses k8s.io/apimachinery/pkg/util/wait.ExponentialBackoffWithContext. Retry wraps raw workClient calls inside each method (before apperrors.MaestroError wraps the error, preserving gRPC status codes).

  3. Log warnings in AlwaysAck — structured warning with event ID, type, and error details before discarding errors. Makes silent publish failures visible in adapter logs.

Why

Maestro restart causes gRPC connection to enter TRANSIENT_FAILURE with grpc-go default MaxDelay=120s. HTTP path recovers instantly (per-request), but gRPC publish path stays in backoff. The adapter silently drops every failed publish via AlwaysAck, so operators see "no progress" instead of "Maestro publish failing". Breaks tier2-nightly maestro_unavailability recovery test (120s window).

Testing

  • Unit tests for isTransientGRPCError (6 cases) and retryOnTransientGRPC (5 scenarios: success, retry-success, max-attempts, non-transient-fast-fail, context-cancellation)
  • All existing AlwaysAck tests updated for new logger parameter
  • make test + make lint pass

Test plan

  • make test — unit tests pass
  • make lint — 0 issues
  • make build — binary builds
  • Verify tier2-nightly maestro_unavailability test passes 5x consistently

…overy

- Cap gRPC reconnect backoff at 5s via ExtraDialOpts/ConnectParams
  (upstream sdk-go PR #230 merged)
- Add bounded retry (3 attempts, exponential backoff) for transient
  gRPC errors (codes.Unavailable) in Create/Get/Patch ManifestWork
- Log structured warning in AlwaysAck before discarding errors
@openshift-ci openshift-ci Bot requested review from pnguyen44 and sherine-k June 30, 2026 14:46
@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tirthct for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of Maestro client operations by retrying transient Unavailable gRPC errors with exponential backoff and added a clearer connection backoff policy.
    • Enhanced handler acknowledgment behavior to consistently discard results (preventing retries) while logging richer event context for warnings.
    • Metrics recording now normalizes outcomes and safely recovers from panic during metrics capture.
  • Tests
    • Updated and expanded unit/integration tests to match the updated handler wrapping and retry behavior.
  • Chores
    • Updated Go version, dependency set, and build container images.

Walkthrough

AlwaysAck now requires a logger and logs discarded errors and failed results, while WithMetrics recovers from panics during metric recording. Maestro client calls for CreateManifestWork, GetManifestWork, and PatchManifestWork now retry transient codes.Unavailable gRPC failures with exponential backoff, and NewMaestroClient sets explicit gRPC connect backoff parameters. go.mod and the Dockerfiles were bumped to Go 1.26 and updated dependency/base-image versions.

Estimated code review effort: 4 (Complex) | ~45 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the main changes: capped gRPC backoff and added Maestro recovery retries.
Description check ✅ Passed The description is directly about the same gRPC recovery, retry, and AlwaysAck logging changes in the diff.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output ✅ Passed PASS: No non-test log statement contains token/password/credential/secret fields or interpolation; the only token match is a non-log error path. CWE-532 not triggered.
No Hardcoded Secrets ✅ Passed Touched files contain no hardcoded credentials, secret literals, embedded creds, or long base64 strings; only benign constants/test data. CWE-798/CWE-259 absent.
No Weak Cryptography ✅ Passed No crypto/md5, crypto/des, crypto/rc4, SHA1-for-security, ECB, custom crypto, or secret-compare misuse in touched files; only crypto/tls/x509 for TLS (CWE-327).
No Injection Vectors ✅ Passed No CWE-89/78/79/502 sinks in modified code; changed files contain no exec.Command, SQL, template.HTML, or yaml.Unmarshal on untrusted input.
No Privileged Containers ✅ Passed No new privileged flags were added; Dockerfiles only changed base images. Existing USER root is justified and the test image is CI-only.
No Pii Or Sensitive Data In Logs ✅ Passed No CWE-532/209 signal: added logs emit only event_id/type, failed_phases, and transient gRPC error strings; no raw bodies, PII, or creds found.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands.

@hyperfleet-ci-bot

hyperfleet-ci-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

Risk Score: 4 — risk/high

Signal Detail Points
PR size 340 lines (>200) +1
Sensitive paths cmd/ +2
Test coverage Missing tests for: cmd/adapter +1

Computed by hyperfleet-risk-scorer

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go.mod`:
- Line 3: The module Go version is out of sync with the repo’s pinned baseline
and toolchain pins. Update the go directive in go.mod to match the repo’s
established Go 1.25.x baseline, or if the higher version is intentional, make
the corresponding AGENTS.md and Bingo tool module pins consistent in the same
change. Keep the version aligned across go.mod, AGENTS.md, and the pinned tool
modules so the baseline remains reproducible.

In `@internal/executor/handler.go`:
- Around line 60-64: The warning path in handler.go currently logs result.Errors
verbatim in the result.Status == StatusFailed branch, which can leak sensitive
upstream payloads. Update the logging in the event handler failure path (around
the errCtx/result handling in the handler function) to stop attaching the full
errors map; instead log only safe identifiers like the phase/step and a
redacted, high-level failure message. Use the existing logger.WithLogFields and
log.Warn call sites to keep context without exposing raw error text.

In `@internal/maestroclient/operations.go`:
- Around line 38-55: The retryOnTransientGRPC helper in Client is swallowing the
last transient gRPC failure by returning false, nil from the
ExponentialBackoffWithContext callback, which causes callers to see a generic
timeout instead of the real error. Update retryOnTransientGRPC to keep track of
the most recent transient error from fn, continue retrying on transient errors
as today, and when backoff is exhausted return that stored error instead of nil
so the original codes.Unavailable is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e22f577b-384e-4bf3-96f7-aa9feda0dbec

📥 Commits

Reviewing files that changed from the base of the PR and between 04dbbc3 and c72d11a.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum, !**/go.sum
📒 Files selected for processing (8)
  • cmd/adapter/main.go
  • go.mod
  • internal/executor/executor_test.go
  • internal/executor/handler.go
  • internal/maestroclient/client.go
  • internal/maestroclient/operations.go
  • internal/maestroclient/operations_test.go
  • test/integration/executor/executor_integration_test.go
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment thread go.mod
module github.com/openshift-hyperfleet/hyperfleet-adapter

go 1.25.0
go 1.26.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== go.mod =='
sed -n '1,12p' go.mod

echo
echo '== Go/toolchain pins across the repo =='
rg -n --hidden --glob '!**/vendor/**' \
  '^(go [0-9]+\.[0-9]+(\.[0-9]+)?|toolchain go[0-9]+\.[0-9]+(\.[0-9]+)?)|golang:[0-9]+\.[0-9]+|GO_VERSION|go-version:' \
  go.mod AGENTS.md .tekton Dockerfile .github Makefile .tool-versions 2>/dev/null || true

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 641


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== top-level files =='
git ls-files | sed -n '1,200p'

echo
echo '== any Go/toolchain version pins in tracked files =='
rg -n --hidden --glob '!**/vendor/**' \
  '^(go [0-9]+\.[0-9]+(\.[0-9]+)?|toolchain go[0-9]+\.[0-9]+(\.[0-9]+)?|FROM .*golang:[0-9]+\.[0-9]+|GOVERSION=|GO_VERSION=|go-version:)' \
  . || true

echo
echo '== go.mod and any go.work =='
sed -n '1,40p' go.mod
[ -f go.work ] && { echo; sed -n '1,40p' go.work; } || true

echo
echo '== CI / build files mentioning Go =='
rg -n --hidden --glob '!**/vendor/**' 'go 1\.[0-9]+|golang:[0-9]+\.[0-9]+|toolchain|GOVERSION|GO_VERSION' \
  .github .tekton Makefile Dockerfile* .tool-versions configs docs 2>/dev/null || true

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 8218


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== architecture review guidance for Go/toolchain or CI pinning =='
rg -n --hidden --glob '!**/vendor/**' \
  'Go 1\.[0-9]+|toolchain|CI|Konflux|Tekton|Go version|CWE-829|supply chain' \
  hyperfleet/standards/code-review docs . || true

Repository: openshift-hyperfleet/architecture

Length of output: 50389


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '== Dockerfile =='
sed -n '1,220p' Dockerfile

echo
echo '== Makefile version/toolchain references =='
rg -n 'go1\.[0-9]+|golang:[0-9]+\.[0-9]+|GOVERSION|GO_VERSION|toolchain|VERSION ?=' Makefile .bingo/Variables.mk .tekton/*.yaml 2>/dev/null || true

echo
echo '== .tekton files mentioning Go/toolchain =='
for f in .tekton/*.yaml; do
  [ -f "$f" ] || continue
  echo "--- $f ---"
  rg -n 'go1\.[0-9]+|golang:[0-9]+\.[0-9]+|GOVERSION|GO_VERSION|toolchain|VERSION ?=' "$f" || true
done

echo
echo '== AGENTS.md baseline references =='
rg -n 'Go 1\.[0-9]+|toolchain|baseline|version' AGENTS.md

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 3227


Align go.mod with the repo’s pinned Go baseline (go.mod:3)
go 1.26.0 drifts from AGENTS.md (Go 1.25.0) and the pinned Bingo tool modules (.bingo/golangci-lint.mod, .bingo/goimports.mod are still on 1.25.0). That’s a reproducibility/supply-chain drift (CWE-829); keep the module on 1.25.x or land the matching toolchain/CI pin updates in the same PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go.mod` at line 3, The module Go version is out of sync with the repo’s
pinned baseline and toolchain pins. Update the go directive in go.mod to match
the repo’s established Go 1.25.x baseline, or if the higher version is
intentional, make the corresponding AGENTS.md and Bingo tool module pins
consistent in the same change. Keep the version aligned across go.mod,
AGENTS.md, and the pinned tool modules so the baseline remains reproducible.

Source: Coding guidelines

Comment thread internal/executor/handler.go
Comment thread internal/maestroclient/operations.go
- Bump go-toolset image to 1.26 in Dockerfile and test/Dockerfile.integration
- Log only failed phase names in AlwaysAck instead of full error map
- Preserve last transient gRPC error on retry exhaustion instead of
  returning generic wait timeout
- Use wait.ExponentialBackoffWithContext for retry logic

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
Dockerfile (1)

3-3: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Repo docs still say Go 1.25.0 — update alongside the 1.26 bump.

The builder image now tracks Go 1.26.x, matching go.mod's go 1.26.0, but the repo's AGENTS.md/README guidance still references "Go 1.25.0" as the toolchain. Update that doc to avoid confusing contributors about the supported version.

As per coding guidelines: "Go 1.25.0 · Cobra CLI · Viper config · golangci-lint (bingo-managed) · Tekton CI (Konflux)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile` at line 3, Update the contributor docs to match the new Go
toolchain version: the Dockerfile now uses Go 1.26.x and go.mod already targets
go 1.26.0, so revise the stale “Go 1.25.0” references in AGENTS.md/README to
avoid conflicting guidance. Locate the version mention in the repo docs and
change the toolchain wording so it consistently reflects Go 1.26.x across the
project.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Dockerfile`:
- Line 3: The builder stage in Dockerfile still uses a mutable Red Hat
go-toolset tag, so update the FROM instruction to reference a fixed `@sha256`
digest instead of the tag. Use the existing builder image reference to locate it
and pin the exact image digest for reproducible, tamper-resistant builds,
matching the same digest-pinning approach used for other base images in the
Dockerfile.

---

Nitpick comments:
In `@Dockerfile`:
- Line 3: Update the contributor docs to match the new Go toolchain version: the
Dockerfile now uses Go 1.26.x and go.mod already targets go 1.26.0, so revise
the stale “Go 1.25.0” references in AGENTS.md/README to avoid conflicting
guidance. Locate the version mention in the repo docs and change the toolchain
wording so it consistently reflects Go 1.26.x across the project.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 17d8e77d-b81d-436f-84e3-ea39b7ffdaa6

📥 Commits

Reviewing files that changed from the base of the PR and between c72d11a and e619cae.

📒 Files selected for processing (4)
  • Dockerfile
  • internal/executor/handler.go
  • internal/maestroclient/operations.go
  • test/Dockerfile.integration
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)
✅ Files skipped from review due to trivial changes (1)
  • test/Dockerfile.integration
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/executor/handler.go
  • internal/maestroclient/operations.go

Comment thread Dockerfile
@kuudori

kuudori commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test unit

@kuudori

kuudori commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test lint

@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant