Skip to content

fix: make V1Channel re-subscribable after a failed subscribe#845

Merged
allenporter merged 6 commits into
Python-roborock:mainfrom
pike00:fix/v1-channel-resubscribe
Jun 22, 2026
Merged

fix: make V1Channel re-subscribable after a failed subscribe#845
allenporter merged 6 commits into
Python-roborock:mainfrom
pike00:fix/v1-channel-resubscribe

Conversation

@pike00

@pike00 pike00 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

A device whose first connect attempt hits a transient error is bricked for the rest of the session: it never gets entities and the log shows the misleading ValueError: Only one subscription allowed at a time. This was observed with two V1 vacuums on one account — the second vacuum (sharing the MQTT session) reliably failed while the first connected fine.

Root cause

The connect path is RoborockDevice.start_connect()connect_loop()connect()V1Channel.subscribe().

  1. V1Channel.subscribe() set self._callback = callback last, and the returned unsub() closure tore down the reconnect task and MQTT/local subscriptions but never cleared self._callback.
  2. In connect(), if v1_properties.start() raised a transient RoborockException (e.g. the initial status fetch timing out for a second device contending for the MQTT session), the except RoborockException block called unsub() and re-raised — leaving self._callback still set.
  3. connect_loop() caught the RoborockException and retried connect(). The retry's subscribe() saw self._callback is not None and raised ValueError("Only one subscription allowed at a time"). That ValueError is not a RoborockException, so it escaped to connect_loop()'s outer except Exception, which logs Uncaught error during connect and returns — permanently giving up. No ready callbacks fire, so no entities are created.

A secondary leak: if subscribe() raised partway through (after starting _reconnect_task / setting _mqtt_unsub), it returned no unsub, so the reconnect task and subscriptions leaked on every retry.

Fix

  • V1Channel.subscribe() now claims self._callback up front and wraps the local-connect / reconnect-task / MQTT-subscribe setup in try/except BaseException: self._teardown(); raise.
  • The teardown logic is factored into an idempotent _teardown() (also returned as the unsub function) that cancels the reconnect task, drops the MQTT/local subscriptions, closes the local channel, and clears self._callback — leaving the channel cleanly re-subscribable.
  • RoborockDevice.connect()'s cleanup is broadened from except RoborockException to except BaseException so any start() failure (not just Roborock errors) releases the channel before propagating.

Net effect: a retry after a transient first-attempt failure now re-subscribes cleanly instead of tripping the stale-callback guard. A genuinely persistent failure becomes a normal backoff-retry loop surfacing the real error, rather than a hard brick behind a misleading ValueError.

Tests

Added regression tests (all fail against the pre-fix code):

  • tests/devices/rpc/test_v1_channel.py
    • subscribe → unsub → subscribe no longer raises.
    • a failure during MQTT subscribe leaves _reconnect_task is None and the channel re-callable (no leaked task/subscription).
  • tests/devices/test_v1_device.py
    • connect() unsubscribes when start() fails (parametrized: RoborockException and a non-Roborock error).
    • end-to-end: start() raises a transient RoborockException once then succeeds → connect_loop() reconnects and the device becomes ready, with no ValueError.

uv run pytest — full suite passes; ruff check and mypy clean.

Copilot AI review requested due to automatic review settings June 9, 2026 12:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a regression where transient failures during device/channel setup could leave subscriptions partially active, preventing subsequent reconnects (e.g., “Only one subscription allowed at a time”).

Changes:

  • Make V1Channel.subscribe() teardown atomic and reusable via a shared _teardown() path.
  • Ensure RoborockDevice.connect() always unsubscribes if start() fails (including non-RoborockException errors).
  • Add regression tests covering resubscribe/retry and atomic failure cleanup.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
roborock/devices/rpc/v1_channel.py Makes subscription setup atomic and introduces _teardown() to reliably reset internal state on error/unsubscribe.
roborock/devices/device.py Broadens cleanup on start() failure to always unsubscribe before re-raising.
tests/devices/test_v1_device.py Adds regression tests ensuring connect() unsubscribes on any start() failure and can retry cleanly.
tests/devices/rpc/test_v1_channel.py Adds tests for subscribe→unsub→subscribe and for atomic cleanup after subscribe failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread roborock/devices/device.py Outdated
Comment thread roborock/devices/rpc/v1_channel.py Outdated
Comment thread roborock/devices/rpc/v1_channel.py
Comment thread tests/devices/test_v1_device.py Outdated
@Lash-L Lash-L marked this pull request as draft June 19, 2026 11:34
@Lash-L

Lash-L commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Can you take a look at Copilots first round of reviews @pike00. I can then take a look afterwards, just mark the PR as 'ready'. Thank you!

@allenporter

Copy link
Copy Markdown
Contributor

Thank you for the fixes here.

V1Channel.subscribe() set self._callback = callback last, and the returned unsub() closure tore down the reconnect task and MQTT/local subscriptions but never cleared self._callback.
In connect(), if v1_properties.start() raised a transient RoborockException (e.g. the initial status fetch timing out for a second device contending for the MQTT session), the except RoborockException block called unsub() and re-raised — leaving self._callback still set.
connect_loop() caught the RoborockException and retried connect(). The retry's subscribe() saw self._callback is not None and raised ValueError("Only one subscription allowed at a time"). That ValueError is not a RoborockException, so it escaped to connect_loop()'s outer except Exception, which logs Uncaught error during connect and returns — permanently giving up. No ready callbacks fire, so no entities are created.

Makes sense to me. Unsetting self._callback on unsub() makes sense here. I'm not sure we need to change where callback is set though. Thanks for the fix!

A secondary leak: if subscribe() raised partway through (after starting _reconnect_task / setting _mqtt_unsub), it returned no unsub, so the reconnect task and subscriptions leaked on every retry.

How would that happen? I would think we wouldn't retry this. Notice we only deal with known RoborockException cases and not other uncaught exceptions which are considered failure. co-pilot made a comment about this catching BaseException and I agree we shouldn't catch such a broad exception here. if we're missing a new unexpeceted exception type we want to fix the bug instead.

…nonce

Address review feedback on Python-roborock#845:

- Narrow the cleanup guards in RoborockDevice.connect() and
  V1Channel.subscribe() from BaseException to Exception so
  KeyboardInterrupt/SystemExit/CancelledError propagate untouched while
  still releasing the channel on normal failures (incl. the ValueError
  and RoborockException retry paths the fix targets).
- Use an actually-16-byte test nonce to match its name and avoid
  relying on absent length validation.
@pike00

pike00 commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

On the partway leak — it's reachable through the existing retry path, not a new one. In subscribe(), after _reconnect_task is created, the MQTT subscribe re-raises RoborockException when not is_local_connected (the "both failed" branch). Pre-fix that re-raise handed back no unsub, so the running _reconnect_task was orphaned. The RoborockException propagates out of connect() to connect_loop(), which catches RoborockException and retries connect()subscribe() again → a second _reconnect_task is created, leaking the first. So each transient "local + MQTT both failed" round leaks a task on the normal backoff loop.

On where _callback is set — you're right that it isn't required for the fix. The fix is teardown clearing _callback plus wrapping setup so teardown runs on failure. Claiming it up front only does two minor things: keeps teardown symmetric (clears exactly what subscribe claims) and closes a small window where a message arriving mid-setup would be dropped while _callback was still None. Neither is necessary — it stays correct with _callback set last, since teardown clears it on the failure path regardless. Happy to move it back to the end for a minimal diff if you'd prefer.

Agreed on dropping BaseException — narrowed both sites to Exception in c429fb2.

@pike00 pike00 marked this pull request as ready for review June 20, 2026 14:26
@Lash-L Lash-L requested a review from allenporter June 20, 2026 18:18

@allenporter allenporter left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some pending review comments i forgot ot send.

Comment thread tests/devices/rpc/test_v1_channel.py Outdated
Comment thread tests/devices/test_v1_device.py Outdated
Comment thread roborock/devices/rpc/v1_channel.py
pike00 added 2 commits June 22, 2026 10:38
Rewrite the V1Channel/RoborockDevice regression tests to assert observable behavior (the callback fires on message arrival; re-subscribe and re-connect succeed) instead of private attributes (_callback, _reconnect_task, _mqtt_unsub, _unsub), per review feedback. Add type: ignore for the test-only mock assignments and FakeChannel construction that were failing the mypy pre-commit hook.

@allenporter allenporter left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you for this! Ready to merge but one last comment

self._logger.debug("MQTT connection also failed: %s", err)
raise
self._logger.debug("MQTT subscription failed, continuing with local-only connection: %s", err)
except Exception:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a style nit but: I'd like to make this clear if its expected or unexpected exceptions here by explicitly catching and forwarding both RoborockException and Exception. w/ teardown code. Essentially i'd say it like: We do expect RoborockException but this is so critical to get right to avoid leaks that we also catch Exception. I'd also like to log that it was un expected uncaught exception here. Same in Device.

That is, i'd not like it to be normal to catch Exception in this code base. We can violate the rules here given this case is important to not leak, but it's a "shouldn't happen".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5f25def. Both V1Channel.subscribe() and RoborockDevice.connect() now catch RoborockException (expected) and Exception (unexpected) as separate branches; each runs the teardown/unsub and re-raises, so behavior is unchanged.

The except Exception branch logs at exception level ("Unexpected error during subscribe/connect; ... to avoid a leak") and carries a comment making clear it is a deliberate, signposted exception to the usual "don't catch Exception" rule, kept only because a leaked subscription or reconnect task would brick the device. The expected RoborockException path stays quiet (the transient cause is already debug-logged at the point it happens).

…connect cleanup

Split the leak-safety catch-all in V1Channel.subscribe() and RoborockDevice.connect() into an explicit `except RoborockException` (the expected failure path) and an `except Exception` that logs the unexpected, shouldn't-happen error before tearing down. Both branches still run teardown/unsub and re-raise, so behavior is unchanged; the broad catch is now a deliberate, signposted exception to the usual don't-catch-Exception rule, and unexpected errors become observable in logs.
@allenporter allenporter merged commit 03193d7 into Python-roborock:main Jun 22, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants