Skip to content

Fix must-gather hang: apply --timeout to log streaming phase#2297

Open
afcollins wants to merge 1 commit into
openshift:mainfrom
afcollins:prow-timeout
Open

Fix must-gather hang: apply --timeout to log streaming phase#2297
afcollins wants to merge 1 commit into
openshift:mainfrom
afcollins:prow-timeout

Conversation

@afcollins

@afcollins afcollins commented Jul 2, 2026

Copy link
Copy Markdown

getGatherContainerLogs had no timeout, blocking forever if gather scripts ran long. Wrap all gather phases in a shared deadline context.

Co-Authored-By: Claude Opus 4.6

Summary by CodeRabbit

  • Bug Fixes
    • Improved gather reliability by applying a single timeout across the full gather lifecycle, including container startup, log streaming, and completion checks.
    • Reduced noisy errors when gather log streaming ends due to timeout or cancellation.
    • Kept the gathered output download step running independently so it can complete after the gather process finishes.

getGatherContainerLogs had no timeout, blocking forever if gather
scripts ran long. Wrap all gather phases in a shared deadline context.

Co-Authored-By: Claude Opus 4.6
Signed-off-by: Andrew Collins <ancollin@redhat.com>
@openshift-ci openshift-ci Bot requested review from atiratree and ingvagabund July 2, 2026 23:13
@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: afcollins
Once this PR has been reviewed and has the lgtm label, please assign atiratree for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Walkthrough

The change scopes a context.WithTimeout around the must-gather container-start, log-streaming, and completion-wait steps within processNextWorkItem, and extends log-streaming error suppression to include context.DeadlineExceeded alongside context.Canceled. The download/copy step continues using the parent context.

Changes

Must-Gather Timeout Scoping

Layer / File(s) Summary
Scope gather lifecycle to a timeout context
pkg/cli/admin/mustgather/mustgather.go
processNextWorkItem now creates a gatherCtx via context.WithTimeout(ctx, o.Timeout) covering container-start polling, log streaming, and gather-completion waiting, with deferred cancellation; log-streaming errors now suppress both context.Canceled and context.DeadlineExceeded, while the download/copy step still uses the original parent context.

Estimated code review effort: 2 (Simple) | ~10 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: applying --timeout to must-gather log streaming to fix hangs.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR changes only pkg/cli/admin/mustgather/mustgather.go and adds no Ginkgo test titles, so there are no unstable test names to flag.
Test Structure And Quality ✅ Passed Only mustgather.go changed; no Ginkgo tests were added or modified, so the test-structure checks don’t apply.
Microshift Test Compatibility ✅ Passed Only pkg/cli/admin/mustgather/mustgather.go changed; no Ginkgo e2e tests or test files were added, so the MicroShift check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Only changed file is pkg/cli/admin/mustgather/mustgather.go (non-test code); no new Ginkgo tests or SNO assumptions were added.
Topology-Aware Scheduling Compatibility ✅ Passed Only timeout-context/log-streaming logic changed in mustgather.go; no pod affinity, selectors, replicas, PDBs, or topology constraints were modified.
Ote Binary Stdout Contract ✅ Passed PR only changes must-gather timeout/log handling; no process-level stdout writes were added, and existing prints are stderr/runtime-path only.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Only pkg/cli/admin/mustgather/mustgather.go changed; no new Ginkgo e2e tests or network/IP assumptions were added.
No-Weak-Crypto ✅ Passed Patch only adds a shared timeout context and log error handling in mustgather; no weak crypto, custom crypto, or secret comparisons were introduced.
Container-Privileges ✅ Passed Diff only changes timeout handling in mustgather.go; no K8s manifest edits or new privileged settings (privileged/hostPID/allowPrivilegeEscalation/SYS_ADMIN/root) were introduced.
No-Sensitive-Data-In-Logs ✅ Passed Diff only adds a shared timeout context and suppresses deadline log noise; no new logs emit secrets, PII, or hostnames.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/cli/admin/mustgather/mustgather.go (1)

834-854: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add/update unit test coverage for the new timeout scoping behavior.

This change alters lifecycle timing and error-suppression logic but no corresponding test changes are included. processNextWorkItem is exercised via MustGatherOptions, so a table-driven test simulating a slow/hanging log stream (context deadline expiring mid-stream) would validate that the command now returns promptly with "gather never finished" instead of hanging, and that DeadlineExceeded from log streaming doesn't spuriously log "gather logs unavailable."
As per coding guidelines, "All changes must include unit test additions/changes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/admin/mustgather/mustgather.go` around lines 834 - 854, Add or update
unit tests for the new timeout scoping in MustGatherOptions.processNextWorkItem.
Cover a slow or hanging getGatherContainerLogs path where gatherCtx expires
mid-stream, and assert the command returns promptly with the expected "gather
never finished" behavior instead of hanging. Also verify that
context.DeadlineExceeded or context.Canceled from getGatherContainerLogs does
not trigger the "gather logs unavailable" log path, while non-context errors
still do.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/cli/admin/mustgather/mustgather.go`:
- Around line 834-854: Add or update unit tests for the new timeout scoping in
MustGatherOptions.processNextWorkItem. Cover a slow or hanging
getGatherContainerLogs path where gatherCtx expires mid-stream, and assert the
command returns promptly with the expected "gather never finished" behavior
instead of hanging. Also verify that context.DeadlineExceeded or
context.Canceled from getGatherContainerLogs does not trigger the "gather logs
unavailable" log path, while non-context errors still do.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 79101a03-76c9-4e86-9a09-f4f771ba4057

📥 Commits

Reviewing files that changed from the base of the PR and between a7ad572 and 2e8f8a3.

📒 Files selected for processing (1)
  • pkg/cli/admin/mustgather/mustgather.go

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@afcollins: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


// wait for gather container to be running (gather is running)
if err := o.waitForGatherContainerRunning(ctx, pod); err != nil {
if err := o.waitForGatherContainerRunning(gatherCtx, pod); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use ctx here. No need to include this kind of waiting in the timeout IMO.

// wait for pod to be running (gather has completed)
log("waiting for gather to complete")
if err := o.waitForGatherToComplete(ctx, pod); err != nil {
if err := o.waitForGatherToComplete(gatherCtx, pod); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are passing a timeout context here, we don't need to use PollUntilContextTimeout and o.Timeout within this function, so it can be removed, I think. Not that it really matters, but better not to repeat the same thing on multiple places.

@tchap

tchap commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Also I am not sure about our policy, but I tend to remove Co-Authored-By: Claude Opus 4.6 from commits. In Kubernetes it's actually forbidden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants