Skip to content

Add initial threat model for substrate#303

Open
Michael Taufen (mtaufen) wants to merge 2 commits into
agent-substrate:mainfrom
mtaufen:threat-model
Open

Add initial threat model for substrate#303
Michael Taufen (mtaufen) wants to merge 2 commits into
agent-substrate:mainfrom
mtaufen:threat-model

Conversation

@mtaufen

Copy link
Copy Markdown
Collaborator

This adds an initial threat model for Substrate. The threat model was originally authored in this Google Doc and shared on the mailing list.

Note that access to the doc is gated by mailing list membership, you must join the list to view it and and individual access requests can't be granted due to a policy restriction. But please comment on this PR at this point if you have feedback :).

@google-cla

google-cla Bot commented Jun 25, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Co-authored-by: Vikas Kumar <skvikas@google.com>
Co-authored-by: Oleg Mitrofanov <gooleg@google.com>
@mesutoezdil

Mesut Oezdil (mesutoezdil) commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Thanks for landing this in the repo. For some reason, I couldn't access the Google link that was shared.

While reading it, for me part of these threats come from the architecture itself. I cannot share my excalidraw link, but you will find a screenshot.

If we move a few things the list gets shorter.

I put an alternative in the attached image, the reasoning behind it:

  1. ate-api-server keeps api, db and identity broker in one box. This is the worst point for me, one compromise and all 3 go at the same time. Identity should not live in the same component as the state db. Better a separate SPIFFE/SPIRE issuer, short lived SVIDs, and no long lived secret kept in the db.

  2. If Router is on the A2A path, a compromised Router reads and rewrites every agent to agent message, this is a large MITM surface. I would keep Router only for control plane (discovery, policy) and do the real A2A over mesh mTLS between the SVIDs, with the token bound to the channel, so even the mesh can not impersonate an agent.

  3. In the worker pod, ateom and the sandbox look like the same trust domain. If the sandbox runs agent code and escapes, it is sitting next to the keys. The component holding the agent private key should not be in the same namespace as the sandbox. Sandbox in a microVM (Kata or Firecracker), identity terminated on the node side (Atelet), short ttl on the SVID.

  4. Pod Snapshots Object Storage is the part I worry about most. A memory snapshot carries secrets, live tokens, also in flight messages and data from other agents. 2 problems from this. First, a token taken from a snapshot, if it is not channel bound, can be replayed after restore on another connection. Second, if you can snapshot or fork, you get 2 live copies with the same identity, single writer is broken and the same side effect can run twice. So checkpoints should be encrypted and credential scrubbed, creds re-minted on restore instead of restored from the dump, plus generation/epoch fencing and idempotency keys so a fork can not act twice. The Actor Directory could be the place enforcing single activation. And it is a fair question if a full memory snapshot is needed at all, or if explicit session state is enough for warm start and safer.

  5. I could not find a fail closed point for egress. (I have share some insisght here as well: https://docs.google.com/document/d/1KmpIFu2gnqy9gp95wASgIo_vkJ_dA1DZckV8upET6bs/edit?tab=t.0#heading=h.3wbvdrfz1pyw) Tool calls and A2A go to external systems. If the policy plane is down, what is the behaviour? If it fails open, a partition becomes free egress for a compromised agent. Egress should be per node, default deny, SNI/ECH aware (no SNI spoof bypass, and ECH should not blind the enforcement), and fail closed.

For me the hardest one is 4, the checkpoint and fork and identity part, I do not have a fully clean answer there yet.

Screenshot 2026-06-26 at 23 36 55

@thockin Tim Hockin (thockin) left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to get the swing of reading this. I left my early comments, but other than internal/external I don't think the rest matter.

Comment thread docs/threat-model.md
* **ateom:** Per-worker Pod sidecar, running inside the worker Pod. Ateom sets up "interior" sandboxes in the worker Pod and manages sandbox lifecycle, including image pulls. It currently uses gvisor but Substrate will support multiple microvm solutions.
* **Worker:** Preprovisioned Pods that actors get scheduled to.
* **Actor:** The core compute primitive, gets scheduled to/from worker via Run for cold start and Resume for snapshot resume.
* **Actor IP:** Actor networking is based on Pod networking. Each actor gets the IP of the worker it's currently scheduled to. The ateom has the opportunity to set up additional rules when it sets up interior sandboxes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it matters here, but I am not sure that an actor having "real" pod networking is going to mean anything. All traffic will be captured, so the touchpoint is near zero.

Comment thread docs/threat-model.md

| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
| :---- | :---- | :---- | :---- | :---- | :---- |
| | Critical | External attacker can access actors over the internet | Access to actors over the external internet is blocked. | Recommend use of infrastructure firewall to limit external ingress/egress in documentation (cloud specific). Use Kubernetes NetworkPolicy to limit external ingress/egress by default. Use additional network policy features depending on CNI. | |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be split up - I see the "internal" section below but no equivalent of this line.

External: Actors, workers, and the atenet routers are not directly exposed to the internet in most clusters. Doing so would require explicit configuration.

Internal: Many kube clusters are managed as first-class citizens of their local network, so peers on the internal network COULD reach then at an IP level. This is not different than any other workload running in Kubernetes.

Comment thread docs/threat-model.md
| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
| :---- | :---- | :---- | :---- | :---- | :---- |
| | Critical | External attacker can access actors over the internet | Access to actors over the external internet is blocked. | Recommend use of infrastructure firewall to limit external ingress/egress in documentation (cloud specific). Use Kubernetes NetworkPolicy to limit external ingress/egress by default. Use additional network policy features depending on CNI. | |
| | Critical | External attacker can access nodes over the internet | Access to nodes over the external internet is blocked. | Recommend use of infrastructure firewall to limit external ingress/egress in documentation (cloud specific). Use Kubernetes NetworkPolicy to limit external ingress/egress by default. Use additional network policy features depending on CNI. | |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying that access is "blocked" in all these cases feels wrong. It feels like it is implying that we did something when in actuality it's the "normal" disposition of Kubernetes. This sort of document is outside my core expertise -- is that distinction important?

Comment thread docs/threat-model.md
| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
| :---- | :---- | :---- | :---- | :---- | :---- |
| | Critical | Access to the internal network allows arbitrary actions to be performed on ate-apiserver, atelet, substrate backend database, etc. | All system components must have basic mutual authentication and authorization, and communicate over TLS. All clients (including end users and actors) must be authenticated and authorized. Unauthenticated traffic must be rejected. | mTLS or other secure channel (e.g. UDS) between networked system components (ate-apiserver, atelet, ateom, etc) each atelet has a unique identity cryptographically tied to the node identity ate-router should check client permissions before resuming actors or forwarding traffic to actors. The only component authorized to connect directly to the backend database should be ate-apiserver. | |
| | High | Privilege escalation via access to sensitive labels. | If Substrate offers its own resource labeling mechanism, it must also offer a way to authorize label updates on a per-label basis. | Substrate authorization system requires explicit authorization to update metadata, separate from updating resource body. Substrate authorization system supports per-label authorization rules. | Plenty of attacks in K8s were possible because labels had semantic meaning, but the permission model could implicitly granted access to modify labels, even if it was inappropriate. For example, /status subresource allows label updates. Substrate shouldn't repeat this mistake. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of presuming what labels are for, but we don't really even have labels yet. I think the point about being more careful is important, though.

Comment thread docs/threat-model.md
| :---- | :---- | :---- | :---- | :---- | :---- |
| | Critical | Access to the internal network allows arbitrary actions to be performed on ate-apiserver, atelet, substrate backend database, etc. | All system components must have basic mutual authentication and authorization, and communicate over TLS. All clients (including end users and actors) must be authenticated and authorized. Unauthenticated traffic must be rejected. | mTLS or other secure channel (e.g. UDS) between networked system components (ate-apiserver, atelet, ateom, etc) each atelet has a unique identity cryptographically tied to the node identity ate-router should check client permissions before resuming actors or forwarding traffic to actors. The only component authorized to connect directly to the backend database should be ate-apiserver. | |
| | High | Privilege escalation via access to sensitive labels. | If Substrate offers its own resource labeling mechanism, it must also offer a way to authorize label updates on a per-label basis. | Substrate authorization system requires explicit authorization to update metadata, separate from updating resource body. Substrate authorization system supports per-label authorization rules. | Plenty of attacks in K8s were possible because labels had semantic meaning, but the permission model could implicitly granted access to modify labels, even if it was inappropriate. For example, /status subresource allows label updates. Substrate shouldn't repeat this mistake. |
| | High | Attacker gains control of Substrate API server, router, or other ingress/egress proxy. | Isolate the control plane from the data plane, and isolate data plane ingress from sandboxes. | Don't co-locate ate-apiserver or other control-plane components on the same machines as the untrusted sandboxes. Consider running any gateway/router that enables direct interaction with sandboxes on a separate VM from the sandboxes, or using a zero-trust architecture where traffic is encrypted and authenticated end-to-end. | |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't co-locate ate-apiserver or other control-plane components on the same machines as the untrusted sandboxes.

I think the point here should be clarified. While we trust the sandboxing tech, there's some ADDITIONAL security to be had by segregating. For large users, or users who run truly 3rd part code in sandboxes, the exposure risk of not segregating likely outweighs any potential benefits.

Comment thread docs/threat-model.md
| | High | Privilege escalation via access to sensitive labels. | If Substrate offers its own resource labeling mechanism, it must also offer a way to authorize label updates on a per-label basis. | Substrate authorization system requires explicit authorization to update metadata, separate from updating resource body. Substrate authorization system supports per-label authorization rules. | Plenty of attacks in K8s were possible because labels had semantic meaning, but the permission model could implicitly granted access to modify labels, even if it was inappropriate. For example, /status subresource allows label updates. Substrate shouldn't repeat this mistake. |
| | High | Attacker gains control of Substrate API server, router, or other ingress/egress proxy. | Isolate the control plane from the data plane, and isolate data plane ingress from sandboxes. | Don't co-locate ate-apiserver or other control-plane components on the same machines as the untrusted sandboxes. Consider running any gateway/router that enables direct interaction with sandboxes on a separate VM from the sandboxes, or using a zero-trust architecture where traffic is encrypted and authenticated end-to-end. | |
| | High | Attacker who can create ActorTemplates specifies malicious runtime. | Ensure available runtime can only be configured by administrators. | Consider a mechanism like RuntimeClass to decouple configuration of available runtimes from consumption of available runtimes. | |
| | High | Attacker who can create ActorTemplates can read or write any storage buckets atelet has access to. | Ensure that bucket access is least-privilege. | Use credentials derived from actor identity to read snapshots. Configure permissions to prevent atelet or nodes from having access to sensitive buckets. | For example: Attacker creates an ActorTemplate with the runsc URL or golden snapshot URL pointing to an arbitrary bucket in the same project/resource scope as the cluster. If atelet has project-wide access to buckets, this could cause the state to be pulled into the worker pod or malicious actor. Similarly, an attacker could set the snapshots URL to point to an internal infrastructure bucket, causing data to be written to that bucket. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggests to me that having a snapshot URL is overly-powerful. Why do we need more than an indentifier which atelet can combine with other configuration to produce that URL?

Comment thread docs/threat-model.md
| :---- | :---- | :---- | :---- | :---- | :---- |
| | Critical | Access to the internal network allows arbitrary actions to be performed on ate-apiserver, atelet, substrate backend database, etc. | All system components must have basic mutual authentication and authorization, and communicate over TLS. All clients (including end users and actors) must be authenticated and authorized. Unauthenticated traffic must be rejected. | mTLS or other secure channel (e.g. UDS) between networked system components (ate-apiserver, atelet, ateom, etc) each atelet has a unique identity cryptographically tied to the node identity ate-router should check client permissions before resuming actors or forwarding traffic to actors. The only component authorized to connect directly to the backend database should be ate-apiserver. | |
| | High | Privilege escalation via access to sensitive labels. | If Substrate offers its own resource labeling mechanism, it must also offer a way to authorize label updates on a per-label basis. | Substrate authorization system requires explicit authorization to update metadata, separate from updating resource body. Substrate authorization system supports per-label authorization rules. | Plenty of attacks in K8s were possible because labels had semantic meaning, but the permission model could implicitly granted access to modify labels, even if it was inappropriate. For example, /status subresource allows label updates. Substrate shouldn't repeat this mistake. |
| | High | Attacker gains control of Substrate API server, router, or other ingress/egress proxy. | Isolate the control plane from the data plane, and isolate data plane ingress from sandboxes. | Don't co-locate ate-apiserver or other control-plane components on the same machines as the untrusted sandboxes. Consider running any gateway/router that enables direct interaction with sandboxes on a separate VM from the sandboxes, or using a zero-trust architecture where traffic is encrypted and authenticated end-to-end. | |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overlaps a bit with "Access to the internal network allows arbitrary actions to be performed on ate-apiserver" ?

Comment thread docs/threat-model.md
| | High | Improper handling of Secrets | Ensure there is an official, secure, recommended way to pass secret data, like API access tokens, to actors. | Support env and filesystem plumbing for Kubernetes Secrets, to provide an official path that avoids secret material being plumbed via nonspecific fields that are difficult to audit. Ensure secrets are encrypted in transit and ideally stored in memory. If exposed via the filesystem, do so via in-memory tmpfs. | If we don't support this, users will inevitably put secrets in plaintext. |
| | Medium | Complexity configuring permissions for frameworks on top of Substrate may lead to unintentional privilege escalation. | It must be clear to users what the downstream effects of auth config in substrate are. | If it's not intuitive, it must be documented in a user guide. | AI framework has to set up permissions to access ATE, and to access K8s, and potentially for actors (based on ATE identity and K8s identity) to access the framework. We need to make this easy. Think about past K8s issues like escalate/bind risk. Substrate resource model is spread across ate-apiserver and K8s, increasing complexity and chance for error. |
| | Medium | Flat namespace of actors encourages broad permission grants or complex graph-oriented policy. | Support a grouping mechanism that can be used in policy controls. | Add namespaces to Substrate, similar to Kubernetes. | |
| | Medium | DNS misconfiguration | Access to DNS configuration should be limited to authoritative controllers. Routing should use stable configurations and query the API for the current IP before routing each request. | Don't co-locate controllers with access to sensitive system state on the same nodes as actors. Limit permissions to update DNS configuration. Actively query Substrate API to ensure IPs are as up-to-date as possible. Potentially use mTLS based on actor DNS name between ate-router and actors. | As noted above, a flood of requests could create high read load on ate-apiserver. Caching could be considered, but cache invalidation during actor rescheduling would be important to avoid misrouting traffic. Establishing a backend mTLS tunnel between ate-router and each actor based on a serving cert signed for the actor's DNS name could be another approach to avoiding misrouting. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider DNS info-leaks as a threat? E.g. I can brute-force discover the names of other namespaces and actors by probing DNS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants