Skip to content

singleArgoCD: createOrUpdateArgoCD unconditionally Update()s the singleton ArgoCD CR every reconcile, causing a self-sustaining fast reconcile loop and application-controller restarts #749

Description

@maximilianoPizarro

Summary

With global.singleArgoCD: true, internal/controller/pattern_controller.go's main Reconcile calls createOrUpdateArgoCD(...) (internal/controller/argo.go) unconditionally on every reconcile pass:

// We only update the clusterwide argo instance so we can define our own 'initcontainers' section
err = createOrUpdateArgoCD(r.dynamicClient, r.fullClient, getClusterWideArgoName(), clusterWideNS, patternsOperatorConfig)

createOrUpdateArgoCD (argo.go) always builds a fresh hardcoded desired spec via newArgoCD(...) and, when the object already exists, calls a plain client.Resource(gvr).Namespace(namespace).Update(...) on it -- with no diff check against the live object beforehand:

} else { // update it
    oldArgo, oldUnstructured, errGet := getArgoCDFunc(client, name, namespace)
    ...
    argo.SetResourceVersion(oldArgo.GetResourceVersion())
    ...
    _, err = client.Resource(gvr).Namespace(namespace).Update(context.TODO(), newArgo, metav1.UpdateOptions{})
}

Since the operator appears to also react to changes on the ArgoCD object it just wrote (directly or transitively, e.g. via the Application/Pattern watch chain), this Update -> triggers another reconcile -> Update again, in a self-sustaining loop that runs far faster than the documented ReconcileLoopRequeueTime (180s) steady-state interval.

Impact (observed live on OpenShift, patterns-operator v0.0.77, gitops-operator v1.21.0)

  • oc get argocd <vpArgoNamespace> -n <vpArgoNamespace> -o jsonpath='{.metadata.resourceVersion}' polled every ~10-15s shows the resourceVersion incrementing continuously with no other actor touching the object (confirmed via --show-managed-fields: the sole non-status managedFields entry for spec belongs to field-manager manager, operation: Update, matching controller-runtime's default field-manager name for a plain typed/dynamic client Update -- not ServerSideApply).
  • Confirmed the root cause is this operator specifically (not ACM policies, not the clustergroup chart, not any local pattern chart) by scaling patterns-operator-controller-manager to 0 replicas: the ArgoCD CR's resourceVersion immediately stopped changing and stayed stable for 90+ seconds; scaling back to 1 replica, the churn resumed within the next reconcile.
  • openshift-gitops-operator's own controller reacts to each of these Update()s by recomputing/reapplying the Deployments/StatefulSet it manages ("Updating StatefulSet ... updating volumes/container resources/container command/container env", "Updating Deployment 'vp-gitops-redis/repo-server/server/applicationset-controller' - updating volumes/...") roughly every 15-30s, which restarts the <name>-application-controller StatefulSet pod on that same cadence.
  • Practical effect: Argo CD Application sync operations that are Running when the controller pod restarts get interrupted and can remain stuck in operationState.phase: Running (observed for over an hour on one Application in our environment) instead of completing, and status.sync.status never settles to Synced for several Applications even though the git repo has no outstanding diff.
  • Separately (same root cause, different symptom): spec.rbac set by newArgoCD()'s hardcoded baseline (defaultPolicy: role:readonly, a 3-line policy granting only system:cluster-admins/cluster-admins/admin -> role:admin, scopes: [groups,email]) always wins over anything a pattern's own chart tries to set on the same CR (even via ServerSideApply=true), since this operator does a plain Update every reconcile. There is no values/Pattern-CR override hook for this baseline today (unlike clustergroup-chart's clusterGroup.argoCD.rbac, added in v0.9.50, which is a no-op for singleArgoCD since clustergroup-chart's own templates/plumbing/argocd.yaml renders nothing at all when global.singleArgoCD=true).

Suggested fix

Before calling Update() in createOrUpdateArgoCD, compute a semantic diff between oldArgo.Spec (or the relevant subset this function owns) and the freshly-built argo.Spec, and skip the API call entirely when they already match. This alone should stop the self-sustaining loop, since there would be nothing left to re-trigger the watch after the first successful convergence.

As a secondary/independent improvement, consider exposing an override hook (mirroring clusterGroup.argoCD.rbac on the non-singleArgoCD path) for at least spec.rbac on the clusterwide instance, so patterns that need a local ArgoCD user with more than read-only access (e.g. an MCP-server-style integration) aren't stuck with the hardcoded 3-line baseline policy.

Environment

  • patterns-operator.v0.0.77 (catalog channel fast, currently the latest available)
  • openshift-gitops-operator.v1.21.0
  • global.singleArgoCD: true, single-cluster hub-only reproduction (also affects hub+spoke topologies)
  • OpenShift 4.20

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions