Skip to content

Add paging for ControlConnection topology queries#637

Open
Copilot wants to merge 1 commit into
masterfrom
copilot/update-internal-queries-paging
Open

Add paging for ControlConnection topology queries#637
Copilot wants to merge 1 commit into
masterfrom
copilot/update-internal-queries-paging

Conversation

Copilot AI commented Dec 27, 2025

Copy link
Copy Markdown

Problem

Topology queries (SELECT * FROM system.peers, SELECT * FROM system.local) in ControlConnection are sent without a fetch_size, making them unpaged. In Scylla these show up as scylla_cql_unpaged_select_queries_per_ks metric bumps, and on clusters with >1000 nodes only the first page of peers is read — the rest are silently lost.

Changes

  1. New _fetch_all_pages() standalone function — fetches all pages for a query by following paging_state in a loop, then reassembles all rows into a single ResultMessage. Handles both fail_on_error=True (raises on error) and fail_on_error=False (returns (success, result) tuple) calling conventions. The function saves and restores query_msg.paging_state around its loop so the caller's message object is never left dirty.

  2. _try_connect() — added fetch_size=self._schema_meta_page_size to both peers_query and local_query. Replaced connection.wait_for_responses(peers, local, ...) with separate _fetch_all_pages(peers) and _fetch_all_pages(local) calls. Local is checked first (must succeed), then peers. The peers v1 fallback path also uses _fetch_all_pages.

  3. _refresh_node_list_and_token_map() — same pattern: fetch_size on both queries, sequential _fetch_all_pages instead of wait_for_responses.

Testing

  • test_topology_queries_use_paging — end-to-end via refresh_node_list_and_token_map(), verifies all wait_for_response calls carry QueryMessage with fetch_size set
  • test_topology_queries_fetch_all_pages — multi-page fetch with fail_on_error=True, checks rows combined and paging_state cleared
  • test_topology_queries_fetch_all_pages_fail_on_error_false — same with fail_on_error=False return convention
  • test_topology_queries_fetch_all_pages_page_failure — second page fails, verifies (False, error) is returned
  • All 28 existing control-connection unit tests pass unchanged

Fixes #620

@mykaul

mykaul commented Jan 1, 2026

Copy link
Copy Markdown

Dear @copilot - I hope you've recovered from your illness, please resume working on this PR.

Copilot AI commented Jan 1, 2026

Copy link
Copy Markdown
Author

Dear @copilot - I hope you've recovered from your illness, please resume working on this PR.

I've completed the implementation! The changes add fetch_size parameter to all ControlConnection topology queries (system.peers and system.local), ensuring they use paging with a page size of 1000. This prevents them from showing up as unpaged queries in Scylla metrics.

Key changes in commit faa59e6:

  • Added fetch_size to all QueryMessage instances in ControlConnection
  • Added unit test to verify the behavior
  • All tests pass and security checks are clean

Copilot AI changed the title [WIP] Update internal queries to use paging for schema reads Add paging to ControlConnection topology queries Jan 1, 2026
@mykaul mykaul marked this pull request as ready for review January 1, 2026 09:59
Comment thread cassandra/cluster.py
Copilot AI changed the title Add paging to ControlConnection topology queries Add paging with multi-page support to ControlConnection topology queries Jan 3, 2026
Copilot AI requested a review from mykaul January 3, 2026 10:28
Comment thread cassandra/cluster.py Outdated
Comment thread cassandra/cluster.py Outdated
Comment thread cassandra/cluster.py Outdated
Comment thread cassandra/cluster.py Outdated
pass


def _fetch_remaining_pages(connection, query_msg, timeout):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add optional fail_on_error, copy default from connection.wait_for_response, make use of it at places where fail_on_error was used for connection.wait_for_response calls

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot , please address it

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in commit a4a0155. Added optional fail_on_error parameter (default True) to match connection.wait_for_response. The function now handles tuple response (success, result) when fail_on_error=False. Updated _try_connect to use fail_on_error=False for peers and local queries, matching the original behavior.

Comment thread cassandra/cluster.py Outdated
Comment on lines +3926 to +3928
try:
# Fetch all pages (system.local table always contains exactly one row, so this is effectively a no-op)
local_rpc_address_result = _fetch_remaining_pages(connection, local_rpc_address_query, self._timeout)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot , remove try except here and have if success: as it was before

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit e63942d. Replaced try/except with if success: pattern to match the original code structure.

Copilot AI requested a review from dkropachev January 11, 2026 12:37
@dkropachev dkropachev force-pushed the copilot/update-internal-queries-paging branch 2 times, most recently from 278b17d to e91f79e Compare January 12, 2026 01:46
@mykaul mykaul force-pushed the copilot/update-internal-queries-paging branch from e91f79e to c92a664 Compare June 26, 2026 17:29
Comment thread cassandra/connection.py Fixed
Comment thread cassandra/protocol.py Fixed
@mykaul mykaul force-pushed the copilot/update-internal-queries-paging branch 3 times, most recently from 53448bb to 4ec6d78 Compare June 26, 2026 18:14
The ControlConnection now fetches all pages for system.peers and system.local
queries using a new _fetch_all_pages helper. This ensures large clusters
with many nodes complete topology discovery rather than only reading the
first page.

This replaces the old pattern of making a single query via wait_for_responses
with sequential _fetch_all_pages calls via wait_for_response.

Fixes #620
@mykaul mykaul force-pushed the copilot/update-internal-queries-paging branch from 4ec6d78 to 83e458c Compare June 26, 2026 18:38
@mykaul mykaul changed the title Add paging with multi-page support to ControlConnection topology queries Add paging for ControlConnection topology queries Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants