Un-fork operator-cloud: bring cloud differences behind a runtime Cloud gate#4980
Un-fork operator-cloud: bring cloud differences behind a runtime Cloud gate#4980Brian-McM wants to merge 3 commits into
Conversation
…d gate
Migrate the tigera/operator-cloud fork's differences into tigera/operator,
gated by the CALICO_CLOUD env var so enterprise/OSS behavior is unchanged
unless cloud mode is active.
- Foundation: pkg/cloud (CALICO_CLOUD gate, tolerant Load), render/common/cloudconfig,
controller/utils/cloudconfig, options.{Cloud,ESMigration}, cmd/main wiring.
- Shared helpers: key_validator cloud tenancy claim (self-gating), auth_cloud,
GetKeyValidatorConfig addTenancyClaim param, meta.DefaultCertificateDuration,
elasticsearch.AddTenantId.
- Per-component cloud paths (all gated): manager, logstorage (linseed/esgateway/
kibana/elastic/external-elastic/dashboards/kubecontrollers + ES ILM), compliance,
fluentd, tiers, intrusiondetection, policyrecommendation, monitor, packetcapture.
- apiserver + kube-controllers RBAC divergences gated behind Cloud.
- Intrusion detection: clean up orphaned Image Assurance resources.
- Cloud version-gen wiring: -cloud-versions flag, cloud.go.tpl, gen-versions-cloud
target, generated pkg/components/cloud.go.
- Cloud-path tests for manager/fluentd/kibana render and elastic/compliance/manager
controllers. Plan + status in docs/cloud-unfork-migration-plan.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the Calico Cloud release tooling, gated by the `cloud` build tag so the enterprise/OSS release tool is unaffected. - hack/release/cloud.go + internal/versions/cloud.go (//go:build cloud): init() wraps the OSS release commands (GCR/tesla image defaults, cloud-vX.Y.Z version format, hashrelease support, CI output files). checks.go/flags.go are NOT changed (cloud.go reassigns isValidReleaseVersion at init). - Makefile: `make release-cloud` / `release-publish-cloud` build with -tags cloud; a gated `ifeq ($(VARIANT),cloud)` block switches image identity to gcr.io/tigera-tesla/operator-cloud (amd64) without affecting enterprise builds. - hack/release/CLOUD.md documents the flow. Verified: untagged release tool builds + tests unchanged; tagged build + cloud test pass; Makefile VARIANT gating switches vars correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| }) | ||
| } | ||
|
|
||
| if cloudOpts.Cloud { |
There was a problem hiding this comment.
Is this real? I'm a bit surprised this would be needed for CC and not for everyone else who uses the operator 🤷
There was a problem hiding this comment.
It probably wouldn't hurt to use it for everybody.
There was a problem hiding this comment.
I think it would potentially slow down rolling update?
There was a problem hiding this comment.
Ohh right, ya it would. It seems like there's a better solution to this issue, how long is the apiserver taking that we need this sort of duration?
| K8sClientset: clientset, | ||
| MultiTenant: multiTenant, | ||
| ElasticExternal: discovery.UseExternalElastic(bootConfig), | ||
| ElasticExternal: discovery.UseExternalElastic(bootConfig) || cloudOpts.ElasticExternal, |
There was a problem hiding this comment.
UseExternalElastic is almost the same thing as ElasticExternal, I think - the main different possibly being the name of the config map that the EXTERNAL_ELASTIC var is set in?
Is there a way we can avoid having this multi-headed config knob?
| MultiTenant: multiTenant, | ||
| ElasticExternal: discovery.UseExternalElastic(bootConfig), | ||
| ElasticExternal: discovery.UseExternalElastic(bootConfig) || cloudOpts.ElasticExternal, | ||
| Cloud: cloudOpts.Cloud, |
There was a problem hiding this comment.
Fine for now, I think longer term this should be folded into the ProductVariant
There was a problem hiding this comment.
Maybe lets just do that now, I didn't actually think about that.
There was a problem hiding this comment.
I looked into folding this into ProductVariant and I don't think it's the right move:
Variantis user-facing config on theInstallationCR (enum-validated), but "this is a Calico Cloud management cluster" is an operator-deployment property that the cloud installer sets — not a product a user picks. They're different layers.- Cloud installs essentially all the enterprise components, so a
CalicoCloudvariant would have to be treated as enterprise everywhere (all ~76IsEnterprise()call sites), and we'd still need a separate cloud flag for the cloud-specific modifications on top. So it doesn't actually remove the knob — it adds a variant on top of it. - Exact
== CalicoEnterprisecomparisons (e.g. the CRD watches incore_controller.go) would silently not match a new variant value, which is an easy source of subtle breakage.
So I'd keep it as the deployment-level CALICO_CLOUD flag: cloud runs as CalicoEnterprise + Cloud=true, which leaves the EE behavior intact and keeps the cloud tweaks orthogonal. Happy to revisit if cloud ever diverges from EE enough to justify a real variant.
There was a problem hiding this comment.
@Brian-McM yeah, definitely not for this PR - that's what I meant to say by "longer term"
But we could have a IsEnterprise and an IsCloud (and IsEnterprise would return true for both).
Happy to revisit if cloud ever diverges from EE enough to justify a real variant.
I don't think it's about divergence, it's about keeping our configuration model clean.
There was a problem hiding this comment.
But we could have a IsEnterprise and an IsCloud (and IsEnterprise would return true for both).
Ya that would be much nicer, I shouldn't have let Claude respond on behalf of me there lol.
I might just end up doing the refactor here, so I'm going to keep this open as a maybe todo.
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Can't comment on them, but we shouldn't be checking in binary files (.pyc) below.
There was a problem hiding this comment.
Ya that was certain a mistake.
Also for future reference you can comment on files, there's a little text box next to the "Viewed" check box on the right.
| import requests | ||
| import ruamel.yaml | ||
|
|
||
| class ManagementClusterUpdater: |
There was a problem hiding this comment.
Not sure this is the right home for this? Seems like this belongs in the cloud tooling repos somewhere instead of the operator?
|
|
||
| // watch spawns a goroutine which should exit if a configmap's data is changed. | ||
| // it is stubbed for testing. | ||
| var watch = func(cs kubernetes.Interface, cmData map[string]string) error { |
There was a problem hiding this comment.
No need to change for this PR, but longer term seems like this should be handled by a controller rather than goroutine spawned as a side-effect of Load() (if we need this configmap long term at all)
| KeyValidatorConfig: keyValidatorConfig, | ||
| KubernetesVersion: r.opts.KubernetesVersion, | ||
| ClusterDomain: r.opts.ClusterDomain, | ||
| Cloud: r.opts.Cloud, |
There was a problem hiding this comment.
I've already started doing this in other places - I'll handle it in my own refactoring but worth calling out that I'm moving to just having the ControllerOptions as a field on the controller instead of copying each var individually.
|
|
||
| // Calico Cloud: update the trusted bundle with system root certificates on management clusters, | ||
| // as they may be needed by compliance-server when the OIDC provider is external. | ||
| // TODO: we should do this closer to instantiation of the trustedBundle but can't because that code |
There was a problem hiding this comment.
Is this TODO fixable now that it's merged upstream?
| reqLogger.Error(nil, "Kibana URL must be specified for this tenant") | ||
| d.status.SetDegraded(operatorv1.ResourceValidationError, "Kibana URL must be specified for this tenant", nil, reqLogger) | ||
| return reconcile.Result{}, nil | ||
| if d.multiTenant || !d.cloud { |
There was a problem hiding this comment.
Hm. This feels a little bit wrong to me - might just come out in the wash if EE doesn't use external ES at all, but an EE cluster that is not multi-tenant will unconditionally error here?
If EE doesn't support this branch, then why do we need the !d.cloud clause?
Might just need an explainer comment.
| if r.cloud { | ||
| // Read the cloud-kibana-config ConfigMap and parse it into a global map in the render package. The render code will read this map | ||
| // and use it to override the default Kibana configuration. | ||
| // TODO: We should instead be passing this via arguments to the render code. |
There was a problem hiding this comment.
Can we fix this TODO now that we're merged upstream?
| } | ||
|
|
||
| if opts.Cloud { | ||
| // Calico Cloud addition. |
There was a problem hiding this comment.
nit: comment is redundant with the code here.
| reqLogger.Error(nil, "Elasticsearch URL must be specified for this tenant") | ||
| r.status.SetDegraded(operatorv1.ResourceValidationError, "Elasticsearch URL must be specified for this tenant", nil, reqLogger) | ||
| return reconcile.Result{}, nil | ||
| if r.multiTenant || !r.cloud { |
There was a problem hiding this comment.
Hm. common code that maybe could use a helper func to avoid desychronizing?
| var mcr render.ManagerCloudResources | ||
| if r.opts.Cloud { | ||
| var reconcileResult *reconcile.Result | ||
| bundleMaker, mcr, tenant, reconcileResult, err = r.handleCloudReconcile( |
There was a problem hiding this comment.
This is probably fine, but I think we probably only structure the code this way because it used to be a fork and we didn't want to interleave merge conflicts - we should probably unpack this as a follow on to clean it up instead of overwriting / recalculating what was done above.
There was a problem hiding this comment.
Oh I fully intend on doing that right now, this is the initial pass.
| // and instead will configure the cluster to use an external Elasticsearch. | ||
| ElasticExternal bool | ||
|
|
||
| // Cloud indicates the operator is running as a Calico Cloud install. When set, controllers |
There was a problem hiding this comment.
"Calico Cloud Install" -> I assume this actually means "Calico Cloud Management Cluster" ?
| "sigs.k8s.io/controller-runtime/pkg/client" | ||
| ) | ||
|
|
||
| // cloudPatchTier removes the allow-tigera tier's app.kubernetes.io/instance label to fix Calico |
There was a problem hiding this comment.
allow-tigera doesn't exist any more.
But also, can't we do this inline now instead of a separate patch step? i.e., an if statement in the render code?
There was a problem hiding this comment.
I think a lot of the code is structured the way it is due to the history of being a fork and wanting to minimize conflicts - but now that it's upstreamed that structuring adds complexity rather than removing it.
|
|
||
| // GetCloudConfig retrieves the config map containing the configuration values needed to set up communications with | ||
| // external Elasticsearch and Kibana, such as the externalESDomain and externalKibanaDomain. | ||
| func GetCloudConfig(ctx context.Context, cli client.Client) (*cloudconfig.CloudConfig, error) { |
There was a problem hiding this comment.
Does it make sense for this to be separate from the other utility functions in pkg/cloud/ ?
| } | ||
| } | ||
| Expect(false).To(BeTrue(), fmt.Sprintf("Missing expected volume mount %s", name)) | ||
| ExpectWithOffset(1, false).To(BeTrue(), fmt.Sprintf("Missing expected volume mount %s", name)) |
There was a problem hiding this comment.
Glad we finally fixed this one upstream 👍
|
|
||
| // Calico Cloud additions. TenantId is only set by the cloud-gated controller path; when empty | ||
| // (regular Calico/Calico Enterprise) no cloud env is emitted. | ||
| TenantId string |
There was a problem hiding this comment.
| TenantId string | |
| TenantID string |
| return evs | ||
| } | ||
|
|
||
| func (e *esGateway) modifyDeploymentForCloud(d *appsv1.Deployment) { |
There was a problem hiding this comment.
Does anywhere in cloud still use es-gateway?
There was a problem hiding this comment.
We still have single tenant clusters which I think still use this.
| } | ||
|
|
||
| if e.cfg.Cloud.EnableMTLS { | ||
| // todo: delete these from the envVars |
There was a problem hiding this comment.
When can we address this TODO? Doesn't really say.
| } | ||
| } | ||
|
|
||
| e.modifyDeploymentForCloud(&d) |
There was a problem hiding this comment.
I'd say this pattern is wrong, but tbh it does look a fair bit like what I'm doing in #4871
So maybe it's actually best to keep it like this and eventually I rebase my PR and refactor lots of these to use the new extension mechanism instead.
|
|
||
| // CloudKibanaConfigOverrides holds Calico Cloud Kibana config overrides. It is populated only by the | ||
| // cloud-gated controller path; for regular Calico/Calico Enterprise it stays empty and is a no-op. | ||
| // TODO: This shouldn't be done with a global variable set by the controller; thread it through Configuration instead. |
| # image. Enterprise/OSS builds (VARIANT unset) are completely unaffected. PUSH_MANIFEST_IMAGE_PREFIXES | ||
| # and PUSH_NONMANIFEST_IMAGE_PREFIXES above use recursive `=`, so they pick up these overrides. | ||
| ifeq ($(VARIANT),cloud) | ||
| REPO:=tigera/operator-cloud |
There was a problem hiding this comment.
Think this might need to be tweaked?
| sh -c '$(GIT_CONFIG_SSH) \ | ||
| go build -buildvcs=false -o hack/bin/release ./hack/release' | ||
|
|
||
| # Calico Cloud release tooling. The cloud release tool is the same tool compiled with `-tags cloud`, |
There was a problem hiding this comment.
Do we need the build tags now? Would rather avoid those if possible.
CC @tigera/release-team / @radTuti to consider how we integrate the release tooling here.
| # image to GCR (gcr.io/tigera-tesla/operator-cloud), amd64 only, instead of the enterprise quay | ||
| # image. Enterprise/OSS builds (VARIANT unset) are completely unaffected. PUSH_MANIFEST_IMAGE_PREFIXES | ||
| # and PUSH_NONMANIFEST_IMAGE_PREFIXES above use recursive `=`, so they pick up these overrides. | ||
| ifeq ($(VARIANT),cloud) |
There was a problem hiding this comment.
if we are setting VARIANT=cloud, do we need all the separate ?-cloud targets? (maybe gen-versions-cloud)
There was a problem hiding this comment.
Collapsed the separate -cloud targets into VARIANT=cloud on the existing release / release-publish / release-tag targets, so there's no release-cloud twin anymore. Kept gen-versions-cloud (as you suggested) since it generates a distinct file (pkg/components/cloud.go) rather than a variant of an existing target's output.
| # Calico Cloud release tooling. The cloud release tool is the same tool compiled with `-tags cloud`, | ||
| # which activates hack/release/cloud.go (GCR/tesla image defaults, cloud-vX.Y.Z version format, | ||
| # hashrelease support). The regular `release`/`release-publish` targets above are unaffected. | ||
| hack/bin/release-cloud: $(shell find ./hack/release -type f) |
There was a problem hiding this comment.
I don't we should need a new binary build to be able to get a cloud release. The release binary should handle be able to handle both.
There was a problem hiding this comment.
My preference would be that we use a new binary so that you can't accidentally modify an env variable and have the operator uninstall cloud components.
This is managing part of the cloud infrastructure, so even if it costs a little more to build and release I would much prefer the guarantee that the cloud flags are always set in the binary when running in cloud.
There was a problem hiding this comment.
Landed here: one release binary, no build tags. cloud.go's cloud behavior now activates at runtime when VARIANT=cloud, so make release VARIANT=cloud / make release-tag VARIANT=cloud … drive the cloud release from the same hack/bin/release — your "the release binary should handle both" point. Removed the -tags cloud build and the release-cloud / release-publish-cloud / release-tag-cloud targets.
Brian's "flags must be guaranteed" concern is handled a layer down: VARIANT=cloud bakes cloud mode into the cloud operator image at build time (-ldflags -X …/pkg/cloud.buildVariant=cloud → pkg/cloud.IsCloudBuild()), so a mutated Deployment env can't flip cloud off. Two images (enterprise + cloud), one release binary.
d648b28 to
4daae70
Compare
Bring the Calico Cloud CI from the fork, adapted to tigera/operator and made to coexist with the enterprise CI. Semaphore (additive — enterprise pipelines untouched): - push_images_cloud.yml / release_cloud.yml: new pipelines that push the operator-cloud image to GCR via VARIANT=cloud (amd64 only); release triggers on cloud-v* tags. - semaphore.yml: two additive promotions (Push Cloud Images, Release Cloud); the enterprise promotions and multi-arch build are unchanged. - Makefile: release-tag-cloud target (uses the -tags cloud release tool). ArgoCI (.argoci/, new to this repo): build-hashrelease + update-cluster-with- hashrelease workflows and hack/hashrelease/*.py cluster updaters. Source-repo references retargeted operator-cloud -> tigera/operator (clone URL/dir, labels, branch param, build command -> make release-cloud release-publish-cloud); the published image name (operator-cloud), GCR project, secrets, Slack and cluster infra are kept as-is. Fork-maintenance machinery (fork-sync crons/templates, update_fork.sh, approve_check.yml) is intentionally dropped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4daae70 to
625f1cb
Compare
Description
Type of change: new feature (Calico Cloud support / un-fork of
tigera/operator-cloud).This brings the differences carried by the long-lived
tigera/operator-cloudfork back intotigera/operator, so the fork can be retired. All cloud behavior is gated behind theCALICO_CLOUDenvironment variable (
cloud.EnableCloudEnvVar), set on the operator Deployment by the cloudinstaller. When the gate is off (every Calico / Calico Enterprise install), behavior is unchanged —
the cloud code paths are inert.
Why merge this: it removes the maintenance burden of a separate fork that had to be continually
re-synced with this repo, and makes Calico Cloud a first-class, gated mode of the upstream operator.
How it works:
pkg/cloudparsesCALICO_CLOUDat startup and returnsCloud=false(no-op) unless it is set,threading
options.ControllerOptions.Cloudto controllers.render.ManagerCloudResources)only when
opts.Cloudis true; render decorators are no-ops otherwise.config/cloud_versions.ymlintopkg/components/cloud.govia anew standalone
make gen-versions-cloudtarget (not part of the defaultgen-versionsaggregate, soenterprise generated output is byte-identical).
Components affected (all gated): manager, logstorage (linseed / esgateway / kibana / elastic /
external-elastic / dashboards / kube-controllers + ES ILM), api-server RBAC, compliance, fluentd, tiers,
intrusion-detection, policy-recommendation, monitor, packet-capture. Intrusion-detection additionally
cleans up orphaned Image Assurance resources.
Testing:
make build/go build ./pkg/... ./cmd/...succeed.esgateway / linseed render and elastic / compliance / manager controllers), each with cloud-on and
cloud-off (negative) cases confirming enterprise output is unchanged.
master; cloud version generation is idempotent.A full migration map and rationale is in
docs/cloud-unfork-migration-plan.md.Prerequisite context:
tigera/operator-cloud#1059(removal of dead Image Assurance / runtime security)landed in the fork first.
Release Note
For PR author
make gen-files(noapi/v1/ CRD changes in this PR)make gen-versions(make gen-versions-cloud→pkg/components/cloud.go)For PR reviewers
A note for code reviewers - all pull requests must have the following:
kind/bugif this is a bugfix.kind/enhancementif this is a a new feature.enterpriseif this PR applies to Calico Enterprise only.🤖 Generated with Claude Code