Skip to content

Round-robin: reuse a sibling-IP HTTP/2 connection when the per-IP connection is permit-starved #2214

Description

@pavel-ptashyts

Context

Follow-up from #2202 (LoadBalance.ROUND_ROBIN), raised in this review comment.

In round-robin mode each request is pinned to a resolved IP via RoundRobinPartitionKey(base, IP), but the connection permit is still per host (maxConnectionsPerHost, keyed by the base host key). These two facts interact badly under a configured connection cap.

Problem

  • Request A opens an HTTP/2 connection to IP_A. It is registered in the H2 registry under its per-IP key (base, IP_A) (NettyConnectListener calls registerHttp2Connection(future.getPartitionKey(), …), and getPartitionKey() returns the per-IP override).
  • Request B is pinned to IP_B. The host is already at maxConnectionsPerHost, so acquirePartitionLockLazily() fails and B enters waitForHttp2Connection.
  • B polls the registry with its own key (base, IP_B) and finds nothing — A's connection lives only under (base, IP_A).
  • B can neither open a new connection (host permit exhausted) nor reuse A's sibling connection. Off the event loop it spins for the full connectTimeout in waitForHttp2Connection and then fails with the original permit exception — a stall followed by a failure.

This only bites when maxConnectionsPerHost is configured; the default is unlimited, so most users never hit it. That is why #2202 ships an accurate doc as the stopgap rather than a behavioral change.

Why the obvious fix does not work

Falling back to a poll on the per-host base key is a no-op: the H2 registry (ChannelManager.http2Connections) is an exact-key ConcurrentHashMap, and nothing is ever registered under the base key in round-robin mode. A one-line "poll the base key instead" swap would compile and change nothing.

Real fix

For B to reuse A's connection, the registry must let a request find any open H2 connection for the host across its per-IP keys — i.e. index the H2 registry by base key (or scan sibling round-robin keys for the same base). This is a genuine change to registry indexing, not a trivial poll swap, and should be scoped/reviewed as such.

Notes

  • The stall-then-fail on connectTimeout (off the event loop) is worth covering in any fix or test.
  • Active health checks / failed-IP handling are tracked separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions