Design proposal: self-hosted in-cluster registry for air-gapped Cozystack#21
Design proposal: self-hosted in-cluster registry for air-gapped Cozystack#21George Gaál (gecube) wants to merge 1 commit into
Conversation
…ozystack Migrated from discussion cozystack/cozystack#3029 to the design-proposal process for review. Signed-off-by: Gaál György <gb12335@gmail.com>
📝 WalkthroughWalkthroughA new design proposal document is added at ChangesAir-gap in-cluster registry design proposal
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for establishing a self-hosted, in-cluster registry to support air-gapped Cozystack installations. The proposed workflow bootstraps from a temporary local registry and transitions to a persistent in-cluster registry using containerd registry mirrors. The review feedback highlights two key areas for improvement: first, removing the public registry fallback from the mirror configuration to prevent connection timeouts in strictly air-gapped environments, and second, clarifying how the internal registry domain will be resolved at the host containerd level since Talos nodes typically bypass in-cluster DNS.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| machine: | ||
| registries: | ||
| mirrors: | ||
| ghcr.io: | ||
| endpoints: | ||
| - https://registry.cozy-system # internal, preferred | ||
| - https://ghcr.io # fallback |
There was a problem hiding this comment.
In a strictly air-gapped environment, including the public registry (https://ghcr.io) as a fallback endpoint can cause containerd to experience long connection timeouts when the internal registry is temporarily unavailable, rather than failing fast. Consider omitting the public fallback or making it conditionally configured.
| machine: | |
| registries: | |
| mirrors: | |
| ghcr.io: | |
| endpoints: | |
| - https://registry.cozy-system # internal, preferred | |
| - https://ghcr.io # fallback | |
| machine: | |
| registries: | |
| mirrors: | |
| ghcr.io: | |
| endpoints: | |
| - https://registry.cozy-system # internal, preferred |
| - **Bump before pre-load** → immediate `ImagePullBackOff`; preflight check is the guard. | ||
| - **Total registry outage** → blocks new pulls only; already-running pods keep their cached images. Mitigated by multi-replica HA backing storage. | ||
| - **Incomplete bundle transfer** → caught by preflight digest verification before any version bump. | ||
| - **DNS for `registry.cozy-system` unresolvable at containerd level** → node cannot pull; the registry domain must resolve on every node, not just in-cluster. |
There was a problem hiding this comment.
Since host-level containerd on Talos nodes typically uses the node's configured upstream DNS rather than the in-cluster CoreDNS (to prevent resolution loops), resolving registry.cozy-system can be problematic. It would be highly beneficial to clarify how this resolution is achieved (e.g., using Talos host entries or a local DNS forwarder).
| - **DNS for `registry.cozy-system` unresolvable at containerd level** → node cannot pull; the registry domain must resolve on every node, not just in-cluster. | |
| - **DNS for registry.cozy-system unresolvable at containerd level** → node cannot pull; the registry domain must resolve on every node, not just in-cluster (e.g., mapped via Talos host entries or a local DNS forwarder). |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@design-proposals/airgap-in-cluster-registry/README.md`:
- Around line 23-28: Clarify the handling of non-OCI Talos assets in the air-gap
flow by updating the proposal around the “Artifact classes” and
upgrade/preflight sections. Specify whether Talos OS assets are stored and
served through the in-cluster registry as artifacts or via an
object-store-backed path with a manifest mapping, and describe how clients
verify them. Use the existing “Artifact classes” wording and the “OCI images and
Talos assets” push/upload steps to make the storage, serving, and verification
contract unambiguous.
- Around line 62-66: Tighten the bootstrap transport guidance in the airgap
registry README by explicitly limiting any plain HTTP or temporary insecure TLS
usage to a single-host or trusted admin-LAN PoC only, and require a clear
expiry/removal step before production cutover. Update the relevant registry push
and TLS verification sections referenced by the existing symbols so they state
these insecure settings must never become the default in runbooks and should be
replaced with secure transport before rollout.
- Around line 76-80: The air-gap mirror example still includes an external
fallback endpoint, which conflicts with the no-egress setup. Update the registry
mirror example in the README so the ghcr.io entry points only to internal
endpoints and remove the https://ghcr.io fallback from the example. Keep the
change localized to the mirror configuration snippet under the air-gap profile.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 70a8294f-8908-4d21-a2a6-824f00c9fe9b
📒 Files selected for processing (1)
design-proposals/airgap-in-cluster-registry/README.md
| Air-gap delivery spans two artifact classes across two cluster tiers: | ||
|
|
||
| **Artifact classes** | ||
| - OCI container images — pulled by containerd at runtime. | ||
| - Talos OS assets — kernel, initramfs, ISO, metal images (bare-metal) and the nocloud disk image (VMs). | ||
|
|
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Clarify where non-OCI Talos assets live and how they are verified.
Line 23-28 defines Talos boot artifacts, but Lines 119-120 say to push “OCI images and Talos assets” to the in-cluster registry (distribution/zot). Please specify the exact storage/serving contract for non-OCI assets (registry-as-artifact vs object store path + manifest mapping), or upgrade preflight is underspecified.
Also applies to: 115-121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@design-proposals/airgap-in-cluster-registry/README.md` around lines 23 - 28,
Clarify the handling of non-OCI Talos assets in the air-gap flow by updating the
proposal around the “Artifact classes” and upgrade/preflight sections. Specify
whether Talos OS assets are stored and served through the in-cluster registry as
artifacts or via an object-store-backed path with a manifest mapping, and
describe how clients verify them. Use the existing “Artifact classes” wording
and the “OCI images and Talos assets” push/upload steps to make the storage,
serving, and verification contract unambiguous.
| docker run -d -p 5000:5000 --name cozy-bootstrap-registry \ | ||
| -v /srv/cozy-registry:/var/lib/registry registry:2 | ||
| cozystack images push --bundle cozystack-airgap-paas-full-v1.5.0.tar \ | ||
| --to http://ADMIN_IP:5000 | ||
| ``` |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Tighten bootstrap transport security requirements.
Line 65 uses plain HTTP, and Line 138 allows temporary insecureSkipVerify. Even for PoC, define strict limits (single-host/admin LAN only, explicit expiry/removal before cutover) to prevent insecure defaults from leaking into production runbooks.
Also applies to: 138-138
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@design-proposals/airgap-in-cluster-registry/README.md` around lines 62 - 66,
Tighten the bootstrap transport guidance in the airgap registry README by
explicitly limiting any plain HTTP or temporary insecure TLS usage to a
single-host or trusted admin-LAN PoC only, and require a clear expiry/removal
step before production cutover. Update the relevant registry push and TLS
verification sections referenced by the existing symbols so they state these
insecure settings must never become the default in runbooks and should be
replaced with secure transport before rollout.
| ghcr.io: | ||
| endpoints: | ||
| - https://registry.cozy-system # internal, preferred | ||
| - https://ghcr.io # fallback | ||
| ``` |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Remove internet fallback from the air-gap mirror example.
Line 79 (https://ghcr.io) conflicts with the no-egress objective and can cause slow/failing pulls if resolution/connect attempts happen before failover logic settles. Use an air-gap profile that only points to internal endpoints.
Proposed doc edit
ghcr.io:
endpoints:
- https://registry.cozy-system # internal, preferred
- - https://ghcr.io # fallback
+ # no external fallback in air-gapped mode📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ghcr.io: | |
| endpoints: | |
| - https://registry.cozy-system # internal, preferred | |
| - https://ghcr.io # fallback | |
| ``` | |
| ghcr.io: | |
| endpoints: | |
| - https://registry.cozy-system # internal, preferred | |
| # no external fallback in air-gapped mode |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@design-proposals/airgap-in-cluster-registry/README.md` around lines 76 - 80,
The air-gap mirror example still includes an external fallback endpoint, which
conflicts with the no-egress setup. Update the registry mirror example in the
README so the ghcr.io entry points only to internal endpoints and remove the
https://ghcr.io fallback from the example. Keep the change localized to the
mirror configuration snippet under the air-gap profile.
Migrates discussion cozystack/cozystack#3029 into the design-proposal process.
Adds
design-proposals/airgap-in-cluster-registry/README.md: evolve air-gap from "bring your own registry" to a bundled, self-hosted flow — offline bundle (OCI images + Talos assets) → throwaway admin registry for bootstrap → self-hosted in-cluster registry (distribution/zot) as the persistent source of truth, with redirection via Talos/containerd registry mirrors.Source discussion: cozystack/cozystack#3029
Sibling proposal (migrated together): #22
DCO: commit is signed off.