Skip to content

Un-fork operator-cloud: bring cloud differences behind a runtime Cloud gate#4980

Open
Brian-McM wants to merge 3 commits into
tigera:masterfrom
Brian-McM:bm-upstream-cloud-differences
Open

Un-fork operator-cloud: bring cloud differences behind a runtime Cloud gate#4980
Brian-McM wants to merge 3 commits into
tigera:masterfrom
Brian-McM:bm-upstream-cloud-differences

Conversation

@Brian-McM

Copy link
Copy Markdown
Contributor

Description

Type of change: new feature (Calico Cloud support / un-fork of tigera/operator-cloud).

This brings the differences carried by the long-lived tigera/operator-cloud fork back into
tigera/operator, so the fork can be retired. All cloud behavior is gated behind the CALICO_CLOUD
environment variable
(cloud.EnableCloudEnvVar), set on the operator Deployment by the cloud
installer. When the gate is off (every Calico / Calico Enterprise install), behavior is unchanged —
the cloud code paths are inert.

Why merge this: it removes the maintenance burden of a separate fork that had to be continually
re-synced with this repo, and makes Calico Cloud a first-class, gated mode of the upstream operator.

How it works:

  • pkg/cloud parses CALICO_CLOUD at startup and returns Cloud=false (no-op) unless it is set,
    threading options.ControllerOptions.Cloud to controllers.
  • Controllers populate typed per-component cloud extension structs (e.g. render.ManagerCloudResources)
    only when opts.Cloud is true; render decorators are no-ops otherwise.
  • Cloud version pins are generated from config/cloud_versions.yml into pkg/components/cloud.go via a
    new standalone make gen-versions-cloud target (not part of the default gen-versions aggregate, so
    enterprise generated output is byte-identical).

Components affected (all gated): manager, logstorage (linseed / esgateway / kibana / elastic /
external-elastic / dashboards / kube-controllers + ES ILM), api-server RBAC, compliance, fluentd, tiers,
intrusion-detection, policy-recommendation, monitor, packet-capture. Intrusion-detection additionally
cleans up orphaned Image Assurance resources.

Testing:

  • make build / go build ./pkg/... ./cmd/... succeed.
  • Unit tests pass for every touched package, including new cloud-path tests (manager / fluentd / kibana /
    esgateway / linseed render and elastic / compliance / manager controllers), each with cloud-on and
    cloud-off (negative) cases confirming enterprise output is unchanged.
  • Branch rebased cleanly on master; cloud version generation is idempotent.

A full migration map and rationale is in docs/cloud-unfork-migration-plan.md.

Prerequisite context: tigera/operator-cloud#1059 (removal of dead Image Assurance / runtime security)
landed in the fork first.

Release Note

NONE

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files (no api/v1 / CRD changes in this PR)
  • If changing versions, run make gen-versions (make gen-versions-cloudpkg/components/cloud.go)

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

🤖 Generated with Claude Code

…d gate

Migrate the tigera/operator-cloud fork's differences into tigera/operator,
gated by the CALICO_CLOUD env var so enterprise/OSS behavior is unchanged
unless cloud mode is active.

- Foundation: pkg/cloud (CALICO_CLOUD gate, tolerant Load), render/common/cloudconfig,
  controller/utils/cloudconfig, options.{Cloud,ESMigration}, cmd/main wiring.
- Shared helpers: key_validator cloud tenancy claim (self-gating), auth_cloud,
  GetKeyValidatorConfig addTenancyClaim param, meta.DefaultCertificateDuration,
  elasticsearch.AddTenantId.
- Per-component cloud paths (all gated): manager, logstorage (linseed/esgateway/
  kibana/elastic/external-elastic/dashboards/kubecontrollers + ES ILM), compliance,
  fluentd, tiers, intrusiondetection, policyrecommendation, monitor, packetcapture.
- apiserver + kube-controllers RBAC divergences gated behind Cloud.
- Intrusion detection: clean up orphaned Image Assurance resources.
- Cloud version-gen wiring: -cloud-versions flag, cloud.go.tpl, gen-versions-cloud
  target, generated pkg/components/cloud.go.
- Cloud-path tests for manager/fluentd/kibana render and elastic/compliance/manager
  controllers. Plan + status in docs/cloud-unfork-migration-plan.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the Calico Cloud release tooling, gated by the `cloud` build tag so the
enterprise/OSS release tool is unaffected.

- hack/release/cloud.go + internal/versions/cloud.go (//go:build cloud): init()
  wraps the OSS release commands (GCR/tesla image defaults, cloud-vX.Y.Z version
  format, hashrelease support, CI output files). checks.go/flags.go are NOT changed
  (cloud.go reassigns isValidReleaseVersion at init).
- Makefile: `make release-cloud` / `release-publish-cloud` build with -tags cloud;
  a gated `ifeq ($(VARIANT),cloud)` block switches image identity to
  gcr.io/tigera-tesla/operator-cloud (amd64) without affecting enterprise builds.
- hack/release/CLOUD.md documents the flow.

Verified: untagged release tool builds + tests unchanged; tagged build + cloud
test pass; Makefile VARIANT gating switches vars correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread cmd/main.go
})
}

if cloudOpts.Cloud {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this real? I'm a bit surprised this would be needed for CC and not for everyone else who uses the operator 🤷

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably wouldn't hurt to use it for everybody.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would potentially slow down rolling update?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh right, ya it would. It seems like there's a better solution to this issue, how long is the apiserver taking that we need this sort of duration?

Comment thread cmd/main.go
K8sClientset: clientset,
MultiTenant: multiTenant,
ElasticExternal: discovery.UseExternalElastic(bootConfig),
ElasticExternal: discovery.UseExternalElastic(bootConfig) || cloudOpts.ElasticExternal,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UseExternalElastic is almost the same thing as ElasticExternal, I think - the main different possibly being the name of the config map that the EXTERNAL_ELASTIC var is set in?

Is there a way we can avoid having this multi-headed config knob?

Comment thread cmd/main.go
MultiTenant: multiTenant,
ElasticExternal: discovery.UseExternalElastic(bootConfig),
ElasticExternal: discovery.UseExternalElastic(bootConfig) || cloudOpts.ElasticExternal,
Cloud: cloudOpts.Cloud,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now, I think longer term this should be folded into the ProductVariant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lets just do that now, I didn't actually think about that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into folding this into ProductVariant and I don't think it's the right move:

  • Variant is user-facing config on the Installation CR (enum-validated), but "this is a Calico Cloud management cluster" is an operator-deployment property that the cloud installer sets — not a product a user picks. They're different layers.
  • Cloud installs essentially all the enterprise components, so a CalicoCloud variant would have to be treated as enterprise everywhere (all ~76 IsEnterprise() call sites), and we'd still need a separate cloud flag for the cloud-specific modifications on top. So it doesn't actually remove the knob — it adds a variant on top of it.
  • Exact == CalicoEnterprise comparisons (e.g. the CRD watches in core_controller.go) would silently not match a new variant value, which is an easy source of subtle breakage.

So I'd keep it as the deployment-level CALICO_CLOUD flag: cloud runs as CalicoEnterprise + Cloud=true, which leaves the EE behavior intact and keeps the cloud tweaks orthogonal. Happy to revisit if cloud ever diverges from EE enough to justify a real variant.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Brian-McM yeah, definitely not for this PR - that's what I meant to say by "longer term"

But we could have a IsEnterprise and an IsCloud (and IsEnterprise would return true for both).

Happy to revisit if cloud ever diverges from EE enough to justify a real variant.

I don't think it's about divergence, it's about keeping our configuration model clean.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could have a IsEnterprise and an IsCloud (and IsEnterprise would return true for both).

Ya that would be much nicer, I shouldn't have let Claude respond on behalf of me there lol.

I might just end up doing the refactor here, so I'm going to keep this open as a maybe todo.

Comment thread docs/cloud-unfork-migration-plan.md Outdated
Comment thread hack/gen-versions/main.go
}
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on them, but we shouldn't be checking in binary files (.pyc) below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya that was certain a mistake.

Also for future reference you can comment on files, there's a little text box next to the "Viewed" check box on the right.

import requests
import ruamel.yaml

class ManagementClusterUpdater:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the right home for this? Seems like this belongs in the cloud tooling repos somewhere instead of the operator?

Comment thread pkg/cloud/watch.go Outdated
Comment thread pkg/cloud/watch.go

// watch spawns a goroutine which should exit if a configmap's data is changed.
// it is stubbed for testing.
var watch = func(cs kubernetes.Interface, cmData map[string]string) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change for this PR, but longer term seems like this should be handled by a controller rather than goroutine spawned as a side-effect of Load() (if we need this configmap long term at all)

KeyValidatorConfig: keyValidatorConfig,
KubernetesVersion: r.opts.KubernetesVersion,
ClusterDomain: r.opts.ClusterDomain,
Cloud: r.opts.Cloud,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already started doing this in other places - I'll handle it in my own refactoring but worth calling out that I'm moving to just having the ControllerOptions as a field on the controller instead of copying each var individually.

Comment thread pkg/controller/compliance/compliance_controller.go Outdated

// Calico Cloud: update the trusted bundle with system root certificates on management clusters,
// as they may be needed by compliance-server when the OIDC provider is external.
// TODO: we should do this closer to instantiation of the trustedBundle but can't because that code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO fixable now that it's merged upstream?

reqLogger.Error(nil, "Kibana URL must be specified for this tenant")
d.status.SetDegraded(operatorv1.ResourceValidationError, "Kibana URL must be specified for this tenant", nil, reqLogger)
return reconcile.Result{}, nil
if d.multiTenant || !d.cloud {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. This feels a little bit wrong to me - might just come out in the wash if EE doesn't use external ES at all, but an EE cluster that is not multi-tenant will unconditionally error here?

If EE doesn't support this branch, then why do we need the !d.cloud clause?

Might just need an explainer comment.

if r.cloud {
// Read the cloud-kibana-config ConfigMap and parse it into a global map in the render package. The render code will read this map
// and use it to override the default Kibana configuration.
// TODO: We should instead be passing this via arguments to the render code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix this TODO now that we're merged upstream?

}

if opts.Cloud {
// Calico Cloud addition.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment is redundant with the code here.

reqLogger.Error(nil, "Elasticsearch URL must be specified for this tenant")
r.status.SetDegraded(operatorv1.ResourceValidationError, "Elasticsearch URL must be specified for this tenant", nil, reqLogger)
return reconcile.Result{}, nil
if r.multiTenant || !r.cloud {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. common code that maybe could use a helper func to avoid desychronizing?

var mcr render.ManagerCloudResources
if r.opts.Cloud {
var reconcileResult *reconcile.Result
bundleMaker, mcr, tenant, reconcileResult, err = r.handleCloudReconcile(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine, but I think we probably only structure the code this way because it used to be a fork and we didn't want to interleave merge conflicts - we should probably unpack this as a follow on to clean it up instead of overwriting / recalculating what was done above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I fully intend on doing that right now, this is the initial pass.

Comment thread pkg/controller/options/options.go Outdated
// and instead will configure the cluster to use an external Elasticsearch.
ElasticExternal bool

// Cloud indicates the operator is running as a Calico Cloud install. When set, controllers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Calico Cloud Install" -> I assume this actually means "Calico Cloud Management Cluster" ?

"sigs.k8s.io/controller-runtime/pkg/client"
)

// cloudPatchTier removes the allow-tigera tier's app.kubernetes.io/instance label to fix Calico

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow-tigera doesn't exist any more.

But also, can't we do this inline now instead of a separate patch step? i.e., an if statement in the render code?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of the code is structured the way it is due to the history of being a fork and wanting to minimize conflicts - but now that it's upstreamed that structuring adds complexity rather than removing it.


// GetCloudConfig retrieves the config map containing the configuration values needed to set up communications with
// external Elasticsearch and Kibana, such as the externalESDomain and externalKibanaDomain.
func GetCloudConfig(ctx context.Context, cli client.Client) (*cloudconfig.CloudConfig, error) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for this to be separate from the other utility functions in pkg/cloud/ ?

}
}
Expect(false).To(BeTrue(), fmt.Sprintf("Missing expected volume mount %s", name))
ExpectWithOffset(1, false).To(BeTrue(), fmt.Sprintf("Missing expected volume mount %s", name))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad we finally fixed this one upstream 👍


// Calico Cloud additions. TenantId is only set by the cloud-gated controller path; when empty
// (regular Calico/Calico Enterprise) no cloud env is emitted.
TenantId string

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TenantId string
TenantID string

return evs
}

func (e *esGateway) modifyDeploymentForCloud(d *appsv1.Deployment) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anywhere in cloud still use es-gateway?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have single tenant clusters which I think still use this.

}

if e.cfg.Cloud.EnableMTLS {
// todo: delete these from the envVars

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can we address this TODO? Doesn't really say.

}
}

e.modifyDeploymentForCloud(&d)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this pattern is wrong, but tbh it does look a fair bit like what I'm doing in #4871

So maybe it's actually best to keep it like this and eventually I rebase my PR and refactor lots of these to use the new extension mechanism instead.


// CloudKibanaConfigOverrides holds Calico Cloud Kibana config overrides. It is populated only by the
// cloud-gated controller path; for regular Calico/Calico Enterprise it stays empty and is a no-op.
// TODO: This shouldn't be done with a global variable set by the controller; thread it through Configuration instead.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hear, hear

Comment thread Makefile Outdated
# image. Enterprise/OSS builds (VARIANT unset) are completely unaffected. PUSH_MANIFEST_IMAGE_PREFIXES
# and PUSH_NONMANIFEST_IMAGE_PREFIXES above use recursive `=`, so they pick up these overrides.
ifeq ($(VARIANT),cloud)
REPO:=tigera/operator-cloud

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this might need to be tweaked?

Comment thread Makefile Outdated
sh -c '$(GIT_CONFIG_SSH) \
go build -buildvcs=false -o hack/bin/release ./hack/release'

# Calico Cloud release tooling. The cloud release tool is the same tool compiled with `-tags cloud`,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the build tags now? Would rather avoid those if possible.

CC @tigera/release-team / @radTuti to consider how we integrate the release tooling here.

Comment thread Makefile
# image to GCR (gcr.io/tigera-tesla/operator-cloud), amd64 only, instead of the enterprise quay
# image. Enterprise/OSS builds (VARIANT unset) are completely unaffected. PUSH_MANIFEST_IMAGE_PREFIXES
# and PUSH_NONMANIFEST_IMAGE_PREFIXES above use recursive `=`, so they pick up these overrides.
ifeq ($(VARIANT),cloud)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are setting VARIANT=cloud, do we need all the separate ?-cloud targets? (maybe gen-versions-cloud)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapsed the separate -cloud targets into VARIANT=cloud on the existing release / release-publish / release-tag targets, so there's no release-cloud twin anymore. Kept gen-versions-cloud (as you suggested) since it generates a distinct file (pkg/components/cloud.go) rather than a variant of an existing target's output.

Comment thread Makefile Outdated
# Calico Cloud release tooling. The cloud release tool is the same tool compiled with `-tags cloud`,
# which activates hack/release/cloud.go (GCR/tesla image defaults, cloud-vX.Y.Z version format,
# hashrelease support). The regular `release`/`release-publish` targets above are unaffected.
hack/bin/release-cloud: $(shell find ./hack/release -type f)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't we should need a new binary build to be able to get a cloud release. The release binary should handle be able to handle both.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be that we use a new binary so that you can't accidentally modify an env variable and have the operator uninstall cloud components.

This is managing part of the cloud infrastructure, so even if it costs a little more to build and release I would much prefer the guarantee that the cloud flags are always set in the binary when running in cloud.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Landed here: one release binary, no build tags. cloud.go's cloud behavior now activates at runtime when VARIANT=cloud, so make release VARIANT=cloud / make release-tag VARIANT=cloud … drive the cloud release from the same hack/bin/release — your "the release binary should handle both" point. Removed the -tags cloud build and the release-cloud / release-publish-cloud / release-tag-cloud targets.

Brian's "flags must be guaranteed" concern is handled a layer down: VARIANT=cloud bakes cloud mode into the cloud operator image at build time (-ldflags -X …/pkg/cloud.buildVariant=cloudpkg/cloud.IsCloudBuild()), so a mutated Deployment env can't flip cloud off. Two images (enterprise + cloud), one release binary.

@Brian-McM Brian-McM force-pushed the bm-upstream-cloud-differences branch 2 times, most recently from d648b28 to 4daae70 Compare July 3, 2026 15:54
Bring the Calico Cloud CI from the fork, adapted to tigera/operator and made to
coexist with the enterprise CI.

Semaphore (additive — enterprise pipelines untouched):
- push_images_cloud.yml / release_cloud.yml: new pipelines that push the
  operator-cloud image to GCR via VARIANT=cloud (amd64 only); release triggers on
  cloud-v* tags.
- semaphore.yml: two additive promotions (Push Cloud Images, Release Cloud); the
  enterprise promotions and multi-arch build are unchanged.
- Makefile: release-tag-cloud target (uses the -tags cloud release tool).

ArgoCI (.argoci/, new to this repo): build-hashrelease + update-cluster-with-
hashrelease workflows and hack/hashrelease/*.py cluster updaters. Source-repo
references retargeted operator-cloud -> tigera/operator (clone URL/dir, labels,
branch param, build command -> make release-cloud release-publish-cloud); the
published image name (operator-cloud), GCR project, secrets, Slack and cluster
infra are kept as-is.

Fork-maintenance machinery (fork-sync crons/templates, update_fork.sh,
approve_check.yml) is intentionally dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Brian-McM Brian-McM force-pushed the bm-upstream-cloud-differences branch from 4daae70 to 625f1cb Compare July 3, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants