Skip to content

fix: Prevent orphaning records due to label lengths#6431

Open
mmiller-sh wants to merge 7 commits into
kubernetes-sigs:masterfrom
mmiller-sh:fix/orphaned-records
Open

fix: Prevent orphaning records due to label lengths#6431
mmiller-sh wants to merge 7 commits into
kubernetes-sigs:masterfrom
mmiller-sh:fix/orphaned-records

Conversation

@mmiller-sh

@mmiller-sh mmiller-sh commented May 11, 2026

Copy link
Copy Markdown

What does it do ?

This prevents the controller from creating A/AAAA/CNAME records when a corresponding ownership TXT record is unable to be created for the domain.

In cases where the A/AAAA/CNAME record is under the 63 character DNS label length limit, but the generated TXT record label (with aaaa- or cname- prefixes) exceeds the limit, these records will become orphaned by the controller given lack of established ownership.

Motivation

Fixes: #6430

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@k8s-ci-robot k8s-ci-robot added the registry Issues or PRs related to a registry label May 11, 2026
@linux-foundation-easycla

linux-foundation-easycla Bot commented May 11, 2026

Copy link
Copy Markdown

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: mmiller-sh / name: Matt Miller (c9b1efb)

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Welcome @mmiller-sh!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Hi @mmiller-sh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 11, 2026
@mmiller-sh mmiller-sh force-pushed the fix/orphaned-records branch from 755bdef to c9b1efb Compare May 11, 2026 12:19
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 11, 2026
@mmiller-sh mmiller-sh marked this pull request as ready for review May 11, 2026 12:19
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@k8s-ci-robot k8s-ci-robot requested a review from u-kai May 11, 2026 12:19

@u-kai u-kai left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!
Could you take a look at my comment?

Comment thread registry/txt/registry.go Outdated
Comment on lines +352 to +356
txts := im.generateTXTRecord(r)
if len(txts) == 0 {
log.Warnf("Skipping create of %s %s: cannot establish ownership TXT (label exceeds RFC 1035 63-char limit)", r.RecordType, r.DNSName)
continue
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understand correct that your fix doesn't bypass the string limit, but simply output log?

@mmiller-sh mmiller-sh May 11, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prevents the A/AAAA/CNAME records from being created (and orphaned) in the first place when the TXT record names cross the string limit, as well as logging the event.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this change breaks backward compatibility and feels more like a workaround.
I would prefer changing the TXT registry format when the record lenght limit is reached.

What do you think?

@mmiller-sh mmiller-sh May 12, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a workaround. I believe moving the record type into it's own label when it would overflow, ie: TXT cname.label1.label2.com should fix the immediate issue. Let me look into what might be needed to safely implement this in a non-breaking way.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@u-kai I created a new PR with the alternate approach: #6436

Thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this change breaks backward compatibility and feels more like a workaround. I would prefer changing the TXT registry format when the record lenght limit is reached.

What do you think?

I have not read that. Actually my view is opposite #6436 (comment). Tool should not change the registry format dynamically. This is an operational hell.

Frankly speaking, the user should have a convention, and manage zone naming. When the DNS hits 50 characters, it's already a signal that it's a time to create a new label/zone

@mmiller-sh

Copy link
Copy Markdown
Author

Reopening based on further feedback on the alternate approach. #6436 (comment)

@mmiller-sh mmiller-sh reopened this May 15, 2026
@mloiseleur

Copy link
Copy Markdown
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 16, 2026
@coveralls

coveralls commented May 16, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 27476544493

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-0.05%) to 80.576%

Details

  • Coverage decreased (-0.05%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • 763 coverage regressions across 32 files.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

763 previously-covered lines in 32 files lost coverage.

Top 10 Files by Coverage Loss Lines Losing Coverage Coverage
traefik_proxy.go 125 67.93%
gateway.go 77 86.9%
kong_tcpingress.go 53 50.0%
gloo_proxy.go 51 74.37%
testutils/endpoint.go 35 56.88%
f5_transportserver.go 33 78.0%
f5_virtualserver.go 33 78.57%
awssd/aws_sd.go 32 84.27%
skipper_routegroup.go 30 35.4%
integration/toolkit/toolkit.go 29 82.23%

Coverage Stats

Coverage Status
Relevant Lines: 21818
Covered Lines: 17580
Line Coverage: 80.58%
Coverage Strength: 1429.36 hits per line

💛 - Coveralls

@ivankatliarchuk ivankatliarchuk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may take a while, as it changes behaviour regardless of how we are going to implement it.

Most likely we should consider a new flag, something like.

--strict-label-compliance
--strict-dns-compliance

Depends what other think about that. I know internally we had a discussion while back, to find ways to reduce number of flags. So No flag, no config, always enforced.o flag, no config, always enforced is an option too.

A new flag implies this is optional or user-configurable behavior. But if ExternalDNS can't establish ownership it most likely should never silently create the record - that's a correctness guarantee, not a preference.

Comment thread registry/txt/registry.go Outdated
Comment on lines +344 to +363
for _, r := range pendingCreate {
if r.Labels == nil {
r.Labels = make(map[string]string)
}
r.Labels[endpoint.OwnerLabelKey] = im.ownerID

filteredChanges.Create = append(filteredChanges.Create, im.generateTXTRecordWithFilter(r, im.existingTXTs.isAbsent)...)
// Skip records whose ownership TXT cannot be established; creating
// them would leak an unreclaimable record into the zone.
txts := im.generateTXTRecord(r)
if len(txts) == 0 {
log.Warnf("Skipping create of %s %s: cannot establish ownership TXT (label exceeds RFC 1035 63-char limit)", r.RecordType, r.DNSName)
continue
}

filteredChanges.Create = append(filteredChanges.Create, r)
for _, txt := range txts {
if im.existingTXTs.isAbsent(txt) {
filteredChanges.Create = append(filteredChanges.Create, txt)
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change manually re-inlines the isAbsent filter after splitting from generateTXTRecordWithFilter. This is fragile if the helper ever gains new behavior. Not sure if it should be even here

Comment thread registry/txt/registry.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

Create: endpoint.FilterEndpointsByDNSCompliance(
          func(ep *endpoint.Endpoint) string { return im.mapper.ToTXTName(ep.DNSName, ep.RecordType) },
          changes.Create,
      ),

and

// FilterEndpointsByDNSCompliance filters out endpoints for which the name derived by
  // nameFunc violates RFC 1035 label length constraints (labels must be ≤ 63 characters).
  func FilterEndpointsByDNSCompliance(nameFunc func(*Endpoint) string, eps []*Endpoint) []*Endpoint {
      filtered := make([]*Endpoint, 0, len(eps))
      for _, ep := range eps {
          valid := true
          for label := range strings.SplitSeq(nameFunc(ep), ".") {
              if len(label) > 63 {
                  log.Warnf("Skipping endpoint %s %s: label %q exceeds RFC 1035 63-char limit", ep.RecordType, ep.DNSName, label)
                  valid = false
                  break
              }
          }
          if valid {
              filtered = append(filtered, ep)
          }
      }
      return filtered
  }

To be consistent with the rest

@mloiseleur

Copy link
Copy Markdown
Collaborator

we should consider a new flag [...] always enforced is an option too.

🤔 As you said, @ivankatliarchuk

A user hitting the limit had a label that was already 57+ chars, which is already pushing DNS convention well past reasonable hostname lengths.

So if we were to introduce a new flag, that would be to support this corner case between 57 & 63 ?
Is it really worth it when this usage is already past reasonable hostname lengths ?

It's a common and known pattern to fail gracefully when the user is asking the software to execute unreachable tasks.

=> So to me, this should be resolved with no new flags and a breaking chance with always enforced behavior where it does not create the record and log a clear message when txt registry is enabled and the domain is 57+ chars.

@u-kai Wdyt ? Is there anything missed with this approach?

@ivankatliarchuk

Copy link
Copy Markdown
Member

"It's a common and known pattern to fail gracefully when the user is asking the software to execute unreachable tasks"

This is a tricky one. Not creating an orphaned record is most likely better than creating one. That's the right call - a record with no ownership TXT is worse than no record at all.

The task isn't "unreachable." The A/AAAA/CNAME record itself is perfectly valid — the label is under 63 chars, and created without any problem. What's unreachable is establishing ownership via the TXT registry mechanism. The constraint is an internal implementation detail of external-dns.

"fail gracefully" usually implies the failure is clearly communicated. A log.Warnf buried in controller output is not graceful from a user perspective - the Ingress has no DNS record and use must to grep logs to find out why. A truly graceful failure in a Kubernetes controller context would surface a condition or event on the object with metric, a kubernetes event, or at minimum a log.Errorf. Silent skips with warn-level logs have caused production incidents in controllers before and will cause it most likely in the future.

I'm not against changing the behaviour, If we all think that orphaned records are worse than missing ones, we could have no flag for that.

I've composed some pros/cons for behavior

Orphaned DNS record (current behavior)

Pros:

  • Traffic actually reaches the destination — no service disruption at creation time

Cons:

  • external-dns can never reclaim it; it lives forever in the zone even after the Ingress/Service is deleted
  • Future reconciles see a record with no owner TXT → treat it as foreign → won't touch it
  • Zone accumulates drift; manual cleanup required
  • Could mask the label-length problem entirely - appears to "work" until deletion

No record at all (this PR)

Pros:

  • Zone stays clean, no unmanaged records accumulate
  • Failure is detectable (log, missing DNS, metric, event)
  • Controller remains authoritative -> what it creates, it can delete

Cons:

  • Service is unreachable immediately — real user-facing impact
  • Failure is only a Warnf in controller logs — easy to miss
  • User gets no signal on the Kubernetes resource itself (no event, no condition)

Orphaned record is the worse long-term outcome - it's hidden, permanent, and requires manual intervention (not great for controller). Missing record is painful immediately but at least it's honest and recoverable (fix the name, DNS appears)

The catch is that "no record" is only acceptable if the failure is loud enough to act on. Right now the log level and lack of a Kubernetes event and metric make it too quiet for the severity of the consequence (broken service). That's the gap worth be clear about.

@mmiller-sh

mmiller-sh commented May 16, 2026

Copy link
Copy Markdown
Author

Noting that the controller currently logs the naming violation when it attempts to reconcile the invalid TXT records, it does not log anything about orphaned records though.

What would you think about increasing log level to error when a record would have been orphaned and adding a Prometheus counter that increments on skips? I'm happy to make both of these changes to make it easier to surface failures involving this DNS spec nuance. I have mild hesitations around adding a metric specific to this corner case/bug fix, but can rationalize it to myself given it would be a single counter. Let me know if that seems reasonable, thank you!

@ivankatliarchuk

Copy link
Copy Markdown
Member

I'm just sharing thoughts and ideas at the moment

@u-kai

u-kai commented May 17, 2026

Copy link
Copy Markdown
Member

=> So to me, this should be resolved with no new flags and a breaking chance with always enforced behavior where it does not create the record and log a clear message when txt registry is enabled and the domain is 57+ chars.

I agree with this approach for simplicity.

@mmiller-sh
Sorry for the extra back-and-forth from my earlier suggestion.

I can see the concern about dynamically changing the registry format now. Keeping the behavior simple and narrowing the supported scope in ExternalDNS instead of introducing more operational complexity sounds like the better approach to me as well.

Address review feedback on kubernetes-sigs#6431:

- Move the label-length check into endpoint.FilterEndpointsByDNSCompliance
  so the registry no longer manually re-inlines the isAbsent filter.
- Log skips at error level — Warnf was easy to miss in controller output.
- Track skips with the external_dns_registry_skipped_records_label_too_long_total
  counter, labeled by record_type and apex domain.
- Document the limit and its observability in docs/registry/txt.md.

Signed-off-by: Matt Miller <matt@mattmiller.io>
@k8s-ci-robot k8s-ci-robot added docs internal Issues or PRs related to internal code labels May 22, 2026
@mmiller-sh

mmiller-sh commented May 22, 2026

Copy link
Copy Markdown
Author

Apologies for the delay getting back to this. I attempted to address feedback with the following:

  • Refactor some code that inlined the absent filter using suggested new helper (I think I addressed this, let me know if I missed the mark!)
  • Raises the log level to error for the condition
  • Adds a metric people could use to be alerted on for the condition
  • docs for the new guard/behavior

Please let me know if I missed anything!

@u-kai u-kai left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just my personal opinions, so please feel free to take them as you see fit. Sorry again for the late feedback!

Comment thread docs/registry/txt.md Outdated
extends that label.

If any label in the projected TXT name exceeds 63 characters, external-dns skips both
the parent record and its ownership TXT. Creating the parent without a TXT would leave

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term parent is used here, but owned record might be a better fit — it maps directly to the ownedRecord label key in endpoint/labels.go and reflects how the registry already models this relationship.

Comment thread endpoint/endpoint.go Outdated
// Such records cannot be owned by the registry and would orphan in the zone.
// onSkip, if non-nil, is invoked per drop with the offending label so callers
// can attach metrics or events without importing this package.
func FilterEndpointsByDNSCompliance(toRegistryName func(*Endpoint) string, eps []*Endpoint, onSkip func(skipped *Endpoint, badLabel string)) []*Endpoint {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late comment — sharing my personal take on the current design. 🙇

FilterEndpointsByDNSCompliance takes a toRegistryName func(*Endpoint) string parameter, which is essentially TXT registry naming logic injected into the endpoint package. I find this
harder to follow than necessary, and toRegistryName doesn't feel like a natural responsibility of the endpoint package.

Also, since NewEndpoint already validates label length and returns nil for violations, filtering endpoints before calling it feels redundant to me — the check ends up happening in two
places conceptually.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about an alternative like this instead?


endpoint/endpoint.go — expose the check as a public utility:

// OverflowingLabel returns the first label in name exceeding RFC 1035's 63-char limit.
func OverflowingLabel(name string) (string, bool) {
    for label := range strings.SplitSeq(name, ".") {
        if len(label) > 63 {
            return label, true
        }
    }
    return "", false
}

registry/txt/registry.go — validate in generateTXTRecord explicitly, removing generateTXTRecordWithFilter:

func (im *TXTRegistry) generateTXTRecord(r *endpoint.Endpoint) ([]*endpoint.Endpoint, error) {
    recordType := r.RecordType
    if shouldUseCNAMEForTxtRecord(r) {
        recordType = endpoint.RecordTypeCNAME
    }
    if im.oldOwnerID != "" && r.Labels[endpoint.OwnerLabelKey] == im.oldOwnerID {
        r.Labels[endpoint.OwnerLabelKey] = im.ownerID
    }

    txtName := im.mapper.ToTXTName(r.DNSName, recordType)
    if label, ok := endpoint.OverflowingLabel(txtName); ok {
        return nil, fmt.Errorf("%s %s: projected TXT name %q has label %q exceeding RFC 1035's 63-char limit", r.RecordType, r.DNSName, txtName, label)
    }

    txtNew := endpoint.NewEndpoint(txtName, endpoint.RecordTypeTXT, r.Labels.Serialize(true, im.txtEncryptEnabled, im.txtEncryptAESKey))
    txtNew.WithSetIdentifier(r.SetIdentifier)
    txtNew.Labels[endpoint.OwnedRecordLabelKey] = r.DNSName
    txtNew.ProviderSpecific = r.ProviderSpecific

    return []*endpoint.Endpoint{txtNew}, nil
}

registry/txt/registry.go — Create loop in ApplyChanges, with isAbsent made explicit at the call site:

filteredChanges := &plan.Changes{
    Create:    make([]*endpoint.Endpoint, 0, len(changes.Create)),
    UpdateNew: endpoint.FilterEndpointsByOwnerID(im.ownerID, changes.UpdateNew),
    UpdateOld: endpoint.FilterEndpointsByOwnerID(im.ownerID, changes.UpdateOld),
    Delete:    endpoint.FilterEndpointsByOwnerID(im.ownerID, changes.Delete),
}

for _, r := range changes.Create {
    if r.Labels == nil {
        r.Labels = make(map[string]string)
    }
    r.Labels[endpoint.OwnerLabelKey] = im.ownerID

    txts, err := im.generateTXTRecord(r)
    if err != nil {
        log.Errorf("Skipping owned record: %v; record would become unmanageable", err)
        recordSkippedLabelTooLong(r)
        continue
    }
    filteredChanges.Create = append(filteredChanges.Create, r)
    for _, txt := range txts {
        if im.existingTXTs.isAbsent(txt) {
            filteredChanges.Create = append(filteredChanges.Create, txt)
        }
    }
    if im.cacheInterval > 0 {
        im.addToCache(r)
    }
}

This keeps the label length check and TXT name generation in one place (generateTXTRecord), avoids leaking registry-specific logic into the endpoint package, and makes isAbsent explicit at the call site.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, if NewEndpoint returned an error instead of nil when endpoint label length overflow, things could be even simpler — but given the broad impact of that change, I'd suggest the approach above as a practical middle ground

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @u-kai, I believe I've addressed your feedback with the most recent changes. Can you let me know what you think?

Signed-off-by: Matt Miller <matt@mattmiller.io>
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2026

@u-kai u-kai left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fixes.

Comment thread endpoint/endpoint.go Outdated
Comment thread registry/txt/registry.go Outdated
Comment on lines +407 to +409
if err != nil {
log.Errorf("Skipping TXT companion for updateNew: %v", err)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generateTXTRecord can never return an error in the Update and Delete paths.
Records in both paths were already recognized as owned Record() method, which means their TXT names were valid at creation time.
Since updates never change the DNS name or record type, only change targets or labels, the same TXT name is recognized and cannot overflow.

IMO, we should drop the error catches.

That's said, I might missing something, please collect me if I'm wrong.

Wdyt??

@mmiller-sh mmiller-sh Jun 7, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My perception is that this error checking is pretty harmless and could potentially be helpful if the registry behavior changes in the future. That said, I'll remove the error check to keep the code cleaner as it pertains to current behavior.

@u-kai u-kai left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating this.
I left a small comment.

Comment thread registry/txt/registry.go
if err != nil {
log.Errorf("Skipping TXT companion for updateOld: %v", err)
}
txts, _ := im.generateTXTRecord(r)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO

Other comments in this file are already duplicated across similar paths, so it would be consistent to add the same "unreachable" explanation to UpdateOld and UpdateNew, not just Delete.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated!

Comment thread registry/txt/registry.go Outdated
Comment on lines +281 to +283
if err != nil {
log.Warnf("Skipping migration check: %v", err)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmiller-sh
IMO, this error is unreachable for the same reason as the _ discards in the Delete/UpdateOld/UpdateNew paths — any ep reaching here already had its TXT name successfully created, so the label-length check can never fire.

Comment thread registry/txt/registry_test.go Outdated
txtPrefix string
newParent func(dnsName string) *endpoint.Endpoint
labelLen int
txtPrefx string // the prefix the mapper prepends, used to assert the TXT name is not in the change set

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename the var to be more clear of it's usage.

Comment thread registry/txt/registry.go Outdated

txtNew := endpoint.NewEndpoint(txtName, endpoint.RecordTypeTXT, r.Labels.Serialize(true, im.txtEncryptEnabled, im.txtEncryptAESKey))
if txtNew == nil {
return nil, nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not covered

Image

Comment thread registry/txt/metrics.go Outdated
prometheus.GaugeOpts{
Subsystem: "registry",
Name: "skipped_records_label_too_long_per_sync",
Help: "Number of records skipped because the projected TXT registry name has a DNS label exceeding RFC 1035's 63-char limit, for each record type and domain (vector).",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the metric name describes the wrong thing. label_too_long describes a property of the original record's label - but that's not what's overflowing. It's the projected TXT name's label that overflows, after the cname-/aaaa-/--txt-prefix is prepended. A record with a 60-char label is perfectly valid and will be skipped only because cname- + 60 = 66 > 63. An operator seeing label_too_long might think their source record is invalid, not that the TXT prefix is the problem.

Maybe something like

registry_skipped_records_label_overflow_per_sync{record_type, domain, overflow_in}

with overflow_in taking two values:

  • "source" — the original DNS name's label is > 63 chars (caught in NewEndpoint)
  • "txt_prefix" — the projected TXT name's label overflows after the record-type prefix and/or --txt-prefix/--txt-suffix is applied

label_overflow rather than label_too_long — "overflow" is the RFC 1035 term and is directional (it implies something caused it to exceed the limit, not that the label was inherently wrong)

Help text: "Number of records skipped per sync because a DNS label exceeds RFC 1035's 63-char limit, by record type, apex domain, and whether the overflow is in the source name or the projected TXT registry name."

@mmiller-sh mmiller-sh Jun 13, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not covered

I believe this check is no longer necessary due to the suggested OverflowingLabel function that was added to check for correctness before calling NewEndpoint per review. I'm going to drop this check.

And the metric name describes the wrong thing.

I agree, skipped_records_label_overflow_per_sync is more descriptive of the actual problem.

One implementation detail I want to call out here while wiring the "source" overflow_in metric label, it initially seemed more natural to me to move the metric registration to the endpoint package, but there are cyclical dependencies that result from doing so. The smallest change set I could come up with to avoid refactoring the metrics package was to keep the metric registration in registry, and wire up a callback function to this from endpoint. Please let me know if this approach seems appropriate, or if you have other suggestions (I couldn't easily find precedent for this type of plumbing in the code base). I also considered having NewEndpoint return error, or overflowing labels so we could detect this in the registry package, but changing the interface seemed like a deal breaker to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, I thought about moving metric, but not found a way yet. Most likely we could do it in follow-up at some point.

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ivankatliarchuk. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ivankatliarchuk ivankatliarchuk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like adding metrics create some design tensions. Most likely worth to drop metrics from this pull request. Add initial fix, and if you have interest, in follo-up pull request you could try to add a metric.

With current design, there is a bit of struggling with data flow, global callbacks and etc

Comment thread endpoint/endpoint.go

// parentDomain returns the parent domain of dnsName (without the first label),
// or dnsName unchanged when it is an apex/two-label name.
func parentDomain(dnsName string) string {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate

func (e *Endpoint) GetNakedDomain() string {

Comment thread endpoint/endpoint.go
}

// OverflowingLabel returns the first label in name longer than 63 chars.
func OverflowingLabel(name string) (string, bool) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is grammatically awkward though; Go convention favors either HasOverflowingLabel(name) bool for a predicate, or FindOverflowingLabel if you need the value.

Not sure if the file location is correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. docs internal Issues or PRs related to internal code ok-to-test Indicates a non-member PR verified by an org member that is safe to test. registry Issues or PRs related to a registry size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

registry/txt: parent A/AAAA created without TXT ownership companion when registry name overflows 63 chars

6 participants