feat: add --provider-cache-patch-on-apply flag to avoid full re-fetch on large zones by tanishcharaya · Pull Request #6447 · kubernetes-sigs/external-dns

tanishcharaya · 2026-05-19T15:32:09Z

What does this PR do?

Adds a --provider-cache-patch-on-apply boolean flag (default false) to CachedProvider.

When enabled, a successful ApplyChanges call patches the in-memory record cache in-place rather than invalidating it. This avoids a full provider re-fetch (e.g. ListResourceRecordSets) on the next reconcile loop after every write.

Why is this needed?

When --provider-cache-time is set, the previous behaviour unconditionally called Reset() before writing to the provider, invalidating the cache regardless of success or failure. On the next Records() call, external-dns would re-fetch the entire zone.

For large zones (e.g. 100k+ Route53 records with creates/deletes every minute) this means a full zone list on every sync cycle — negating the cache entirely and generating significant API cost and latency.

How does it work?

On successful ApplyChanges: removes deleted/updated-old entries and appends created/updated-new entries directly in the cache slice — no provider round-trip.
On failed ApplyChanges: still calls Reset() to guarantee consistency with the provider state.
Default is false — existing behaviour is fully preserved for users who do not set the flag.

Usage

--provider-cache-time=10m --provider-cache-patch-on-apply=true

k8s-ci-robot · 2026-05-19T15:32:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ivankatliarchuk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

linux-foundation-easycla · 2026-05-19T15:32:18Z

❌ The email address for the commit (a441085) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

k8s-ci-robot · 2026-05-19T15:32:19Z

Welcome @tanishcharaya!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-05-19T15:32:21Z

Hi @tanishcharaya. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…rge zones

ivankatliarchuk · 2026-05-19T17:46:54Z

The PR's stated problem: A full zone re-fetch after every write negates the cache on large zones.

My view, only coredns + pihole + rfc2316 +awssd as provider, could currently handle 100k+ records reliably, rest will not. The current zone caching doesn't have this problems - missing a new zone for a few minutes is tolerable. Serving a stale record view to the planner can produce wrong creates, updates, or worse, skip correcting real drift.

I'll compare here two Caches, Fundamentally Different Things (zones vs dns records)

Zone cache (blueprint.ZoneCache[T]) - answers: "which zones do I manage?"

Caches zone IDs and metadata
Thread-safe with sync.RWMutex
Zones change rarely - a stale zone list is low-risk
Appropriate: caching metadata about structure

Record cache (CachedProvider) - answers: "what records currently exist?"

Caches the full flat []*endpoint.Endpoint list across all zones
No mutex - not thread-safe
Records are exactly what external-dns is actively managing
Risky: caching operational state that external-dns itself modifies

But the real problem is larger:

Throttling on initial fetch (and post-crash)

The motivation example is 100k+ Route53 records. Route53's ListResourceRecordSets has a default quota of 5 req/s, page size 300. That's ~333 API calls - around 67 seconds of sustained throttled API traffic just to build the initial cache. This PR does nothing for that window. Any scenario that forces a full re-fetch (pod restart, OOM kill, node eviction, rolling update) is just as expensive as before.

In Kubernetes, pod restarts are routine. The optimization only pays off in the narrow window of a continuously running, stable process - exactly what you can't guarantee in a Kubernetes workload.

The app is stateless by design

external-dns stores nothing durable. The cache lives only in the heap of the current process. The PR's patch-on-apply logic is best-effort warmth between reconcile loops, not a solution to the fetch cost. We are currently caching zones, it's not an expensive operation, but the zones not changing that often

The flag only helps if you already have the cache working

For the optimization to matter, you need:

--provider-cache-time set high enough that the cache doesn't expire between reconciles
No pod restart between writes
No failed ApplyChanges (which resets the cache anyway)

If you're on a 1-minute reconcile with a 10-minute cache and creates/deletes every loop, the math works in the happy path - but in any production Kubernetes environment the restart frequency likely closes that window regularly.

So why not to just set reconcile to 10 minutes?

PatchOnApply - is another sharp case. This solution introduces correctness risk. patch-on-apply makes the cache MORE persistent - even a write no longer triggers a re-fetch. So the cache can now drift indefinitely as long as writes
keep succeeding. The intent is efficiency, but the effect is that external-dns increasingly trusts its own memory over the actual provider state - it just accumulates edits in memory.

Interesting solution, this is a follow-up for #3402

tanishcharaya · 2026-05-19T17:57:42Z

@ivankatliarchuk What would you suggest to solve this problem ?
We are currently using Route 53 as the external DNS provider to sync DNS records for jobs that are created and deleted. As a result, every job creation or deletion triggers fresh Route 53 API calls.
Also referring to this readme. https://github.com/kubernetes-sigs/external-dns/blob/master/docs/advanced/rate-limits.md

In particular, it has been found that with 200k records in an AWS Route53 zone, each refresh triggers around 70 API calls to retrieve all the records, making it more likely to hit the AWS Route53 API rate limits.

If the page size is 300, shouldn't the number of calls be 200k/300 = 0.67k not 0.07k.

Also since we have fresh sync of external provider at regular intervals, shouldn't the cache stay in sync ? It can drift maximium by provider-cache-time.

ivankatliarchuk · 2026-05-19T22:52:32Z

Improving things always welcome. I'll play devils advocate, and will leave it for other reviewers to decide, as don't have strong opinion around this feature. I have not looked at the code, as CLA not signed, most likely there are some edge-cases that needs considering as well.

The controller pattern has one job: observe actual state → compare to desired state → converge

Caching "actual DNS state" means we're not observing - we're trusting a snapshot. The cache breaks the core guarantee the controller exists to provide: that DNS reality matches Kubernetes intent.

The record cache is compensating for one problem: reading a large zone is expensive. That's an API interaction problem. In such case, I'm not sure what the right solution is. For example I'm moved all large records to coredns provider as an example.

A TTL-based cache of managed records is a pragmatic shortcut that trades DNS correctness for cost savings (depends how risky is this correctness). It works until it doesn't - and when it fails silently, DNS is wrong -> as a result it could lead to disaster.

k8s-ci-robot · 2026-06-14T18:02:57Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested a review from mloiseleur May 19, 2026 15:32

k8s-ci-robot added the apis Issues or PRs related to API change label May 19, 2026

k8s-ci-robot requested a review from vflaux May 19, 2026 15:32

k8s-ci-robot added the controller Issues or PRs related to the controller label May 19, 2026

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. provider Issues or PRs related to a provider labels May 19, 2026

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 19, 2026

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 19, 2026

tanishcharaya force-pushed the provider-cache-patch-on-apply branch from 53fd999 to 5cfc0c3 Compare May 19, 2026 15:36

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 19, 2026

add --provider-cache-patch-on-apply flag to avoid full re-fetch on la…

a441085

…rge zones

tanishcharaya force-pushed the provider-cache-patch-on-apply branch from 5cfc0c3 to a441085 Compare May 19, 2026 15:43

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --provider-cache-patch-on-apply flag to avoid full re-fetch on large zones#6447

feat: add --provider-cache-patch-on-apply flag to avoid full re-fetch on large zones#6447
tanishcharaya wants to merge 1 commit into
kubernetes-sigs:masterfrom
tanishcharaya:provider-cache-patch-on-apply

tanishcharaya commented May 19, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

linux-foundation-easycla Bot commented May 19, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

ivankatliarchuk commented May 19, 2026

Uh oh!

tanishcharaya commented May 19, 2026 •

edited

Loading

Uh oh!

ivankatliarchuk commented May 19, 2026

Uh oh!

k8s-ci-robot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tanishcharaya commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is this needed?

How does it work?

Usage

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

linux-foundation-easycla Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

ivankatliarchuk commented May 19, 2026

Uh oh!

tanishcharaya commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivankatliarchuk commented May 19, 2026

Uh oh!

k8s-ci-robot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tanishcharaya commented May 19, 2026 •

edited

Loading

linux-foundation-easycla Bot commented May 19, 2026 •

edited

Loading

tanishcharaya commented May 19, 2026 •

edited

Loading