Skip to content

RR1-T102 Fixes and hardening and systemd support for cli#16

Merged
ucswift merged 2 commits into
masterfrom
develop
Jun 28, 2026
Merged

RR1-T102 Fixes and hardening and systemd support for cli#16
ucswift merged 2 commits into
masterfrom
develop

Conversation

@ucswift

@ucswift ucswift commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

This PR hardens the relay's long-lived LiveKit voice modes (radio, record, dispatch) and adds proper systemd support for the CLI.

Graceful shutdown under systemd / Docker

  • Adds SIGTERM handling in the console host, so systemctl stop and docker stop trigger the same clean teardown path as Ctrl+C instead of killing the process mid-work.

Resilience layer for LiveKit voice modes

  • Introduces a retry-with-exponential-backoff loop and a circuit breaker in RelayServiceBase that wraps the run for LiveKit modes. Faulted runs are automatically restarted after a jittered, capped delay; after a configurable number of consecutive quick failures the circuit opens and the service faults.
  • A run lasting longer than the "healthy run" threshold resets the failure counter, so only rapid repeated failures trip the breaker.
  • Backoff and circuit-breaker thresholds are configurable via new ResilienceOptions.

Hard disconnect detection and reconnection

  • All three voice modes now listen for LiveKit connection state changes. A transient SDK "reconnecting" only degrades the status; a hard disconnect throws so the resilience layer rejoins the room.
  • Recorder and radio modes use linked cancellation tokens / task completion sources so a hard disconnect promptly unwinds the loop and triggers reconnect.

Sentry telemetry for mode lifecycle

  • Adds a new telemetry abstraction (IRelayModeTelemetry) with Sentry-backed and no-op implementations. Mode start, retry, fault, and stop events are reported to Sentry when a DSN is configured.

Tests

  • Adds unit tests covering backoff calculation, circuit-breaker tripping, transient-failure recovery, and the disabled-resilience path.

@Resgrid-Bot

This comment has been minimized.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@ucswift, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 32 minutes and 40 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08b7e6cb-fbfd-415f-92d5-27f4779703c0

📥 Commits

Reviewing files that changed from the base of the PR and between 7bb8622 and c006395.

📒 Files selected for processing (5)
  • Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Telemetry/SentryRelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs
  • Resgrid.Relay.Engine/Voice/RadioMode.cs
  • Resgrid.Relay.Engine/Voice/RecordMode.cs
📝 Walkthrough

Walkthrough

Adds LiveKit-mode retry/backoff and telemetry plumbing to relay services, hard-disconnect handling in the voice modes, new resilience tests, and Linux systemd deployment updates with SIGTERM-driven console shutdown.

Changes

LiveKit resilience and shutdown

Layer / File(s) Summary
Resilience contracts
Resgrid.Relay.Engine/Configuration/RelayHostOptions.cs, Resgrid.Relay.Engine/Telemetry/IRelayModeTelemetry.cs, Resgrid.Relay.Engine/Telemetry/NullRelayModeTelemetry.cs, Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs
RelayHostOptions gains Resilience; the relay mode telemetry contract, no-op implementation, and DSN-based factory are added.
Sentry telemetry
Resgrid.Relay.Engine/Telemetry/SentryRelayModeTelemetry.cs
SentryRelayModeTelemetry initializes Sentry from the provided options and records mode lifecycle, retry, fault, and disposal events.
Relay service resilience
Resgrid.Relay.Engine/Services/RelayServiceBase.cs, Resgrid.Relay.Engine/Services/DispatchRelayService.cs, Resgrid.Relay.Engine/Services/RadioRelayService.cs, Resgrid.Relay.Engine/Services/RecordRelayService.cs
RelayServiceBase now selects telemetry by mode, wraps ExecuteAsync in retry/backoff/circuit-breaker logic, and the relay services now report LiveKit mode explicitly.
LiveKit disconnect handling
Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs, Resgrid.Relay.Engine/Voice/RadioMode.cs, Resgrid.Relay.Engine/Voice/RecordMode.cs
The three voice modes now watch session.ConnectionChanged, treat reconnecting as degraded, and throw on non-reconnecting disconnects so the resilience layer can restart them.
Test coverage and internals
Resgrid.Relay.Engine/Properties/AssemblyInfo.cs, Resgrid.Audio.Voice.Tests/Resgrid.Audio.Voice.Tests.csproj, Resgrid.Audio.Voice.Tests/RelayResilienceTests.cs
The engine is exposed to the test assembly, the test project references it, and new NUnit coverage exercises backoff and LiveKit resilience behavior.
Systemd shutdown support
Resgrid.Audio.Relay.Console/Program.cs, deploy/systemd/README.md, deploy/systemd/relay.env.example, deploy/systemd/resgrid-relay.service
The console app now cancels on SIGTERM, and the systemd README, environment example, and unit file are populated for Linux deployment.

Sequence Diagram(s)

sequenceDiagram
  participant RelayServiceBase
  participant IRelayModeTelemetry
  participant DispatchVoiceMode
  participant session
  RelayServiceBase->>IRelayModeTelemetry: ModeStarting(mode)
  RelayServiceBase->>DispatchVoiceMode: ExecuteAsync(cancellationToken)
  session-->>DispatchVoiceMode: ConnectionChanged(reconnecting)
  DispatchVoiceMode->>DispatchVoiceMode: status.LiveKit = Degraded
  session-->>DispatchVoiceMode: ConnectionChanged(hard disconnect)
  DispatchVoiceMode-->>RelayServiceBase: InvalidOperationException
  RelayServiceBase->>IRelayModeTelemetry: ModeRetrying(mode, ex, attempt, nextDelay)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Resgrid/Relay#13: Modifies the same LiveKit voice mode entrypoints, and this PR extends those flows with hard-disconnect handling and resilience retries.

Poem

I hop through retries, soft and bright,
with SIGTERM moon and Sentry light.
If LiveKit blinks, I leap and mend,
then roll my carrots to the end. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main changes: CLI hardening, resilience fixes, and added systemd support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch develop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
Resgrid.Relay.Engine/Services/RelayServiceBase.cs (1)

28-31: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Virtual member IsLiveKitMode is invoked from the base constructor.

Reading IsLiveKitMode (Line 30) during base construction runs the derived override before the derived constructor body executes. It's safe today since all overrides return a constant, but it's a fragile pattern (CA2214) if an override later depends on instance state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Resgrid.Relay.Engine/Services/RelayServiceBase.cs` around lines 28 - 31, The
base constructor in RelayServiceBase is calling the virtual IsLiveKitMode
property, which can invoke derived overrides before the derived constructor has
finished. Remove the virtual dispatch from the constructor by moving the
decision into a non-virtual initializer, a constructor parameter, or a
static/helper method used by RelayServiceBase, so the telemetry selection no
longer depends on override behavior during construction.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs`:
- Around line 13-16: The Create method in RelayModeTelemetry currently returns a
new SentryRelayModeTelemetry per service, which causes each instance to own
SentrySdk.Init and disposal. Refactor SentryRelayModeTelemetry so Sentry SDK
initialization, flush, and disposal are managed by a single shared process-wide
instance instead of per-service ownership, and update Create to reuse that
shared telemetry instance for all LiveKit services.

In `@Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs`:
- Around line 56-72: `DispatchVoiceMode` leaves `status.LiveKit` stuck at
Degraded after a transient reconnect because the `onConnectionChanged` handler
returns immediately when `change.Connected` is true. Update the
`ConnectionChanged` logic so the same handler that sets Degraded on the
"reconnecting" reason also restores `status.LiveKit` to Connected once the
session reports a healthy connected state, using the existing `disconnect`,
`status`, and `onConnectionChanged` flow.

In `@Resgrid.Relay.Engine/Voice/RadioMode.cs`:
- Around line 86-99: The connection-state handler in RadioMode’s
onConnectionChanged leaves status.LiveKit stuck at Degraded after an
auto-reconnect because the change.Connected path returns early. Update the
reconnect flow so that when the SDK reports a successful reconnect, the status
is restored to Connected (or equivalent recovery state) before returning, and
keep the degraded transition only for the reconnecting case; use the
onConnectionChanged handler and the status.LiveKit updates as the key
touchpoints.

In `@Resgrid.Relay.Engine/Voice/RecordMode.cs`:
- Around line 55-68: `RecordMode` leaves `status.LiveKit` stuck in `Degraded`
after an SDK reconnect because the `VoiceConnectionStateChange` handler returns
immediately on `change.Connected == true` and never restores the state. Update
the reconnect recovery path in the `handler` inside `RecordMode` so that when
the connection becomes connected again it sets `status.LiveKit` back to
`ConnectionState.Connected` before returning, while still preserving the
existing `reconnecting` degraded handling and hard-disconnect restart behavior.
- Around line 84-86: The disconnect path in RecordMode should not throw during a
concurrent shutdown; mirror the RadioMode cancellation guard. In the logic that
awaits Task.WhenAny for VoiceModeRuntime.WaitForCancellationAsync and
disconnect.Task, only throw the InvalidOperationException when
cancellationToken.IsCancellationRequested is false, so a Ctrl+C/SIGTERM shutdown
can complete cleanly without a spurious fault.

---

Nitpick comments:
In `@Resgrid.Relay.Engine/Services/RelayServiceBase.cs`:
- Around line 28-31: The base constructor in RelayServiceBase is calling the
virtual IsLiveKitMode property, which can invoke derived overrides before the
derived constructor has finished. Remove the virtual dispatch from the
constructor by moving the decision into a non-virtual initializer, a constructor
parameter, or a static/helper method used by RelayServiceBase, so the telemetry
selection no longer depends on override behavior during construction.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 802dd7a2-6b94-4576-a289-3fe7aede0078

📥 Commits

Reviewing files that changed from the base of the PR and between 4edaa95 and 7bb8622.

📒 Files selected for processing (19)
  • Resgrid.Audio.Relay.Console/Program.cs
  • Resgrid.Audio.Voice.Tests/RelayResilienceTests.cs
  • Resgrid.Audio.Voice.Tests/Resgrid.Audio.Voice.Tests.csproj
  • Resgrid.Relay.Engine/Configuration/RelayHostOptions.cs
  • Resgrid.Relay.Engine/Properties/AssemblyInfo.cs
  • Resgrid.Relay.Engine/Services/DispatchRelayService.cs
  • Resgrid.Relay.Engine/Services/RadioRelayService.cs
  • Resgrid.Relay.Engine/Services/RecordRelayService.cs
  • Resgrid.Relay.Engine/Services/RelayServiceBase.cs
  • Resgrid.Relay.Engine/Telemetry/IRelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Telemetry/NullRelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Telemetry/SentryRelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs
  • Resgrid.Relay.Engine/Voice/RadioMode.cs
  • Resgrid.Relay.Engine/Voice/RecordMode.cs
  • deploy/systemd/README.md
  • deploy/systemd/relay.env.example
  • deploy/systemd/resgrid-relay.service

Comment thread Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs Outdated
Comment thread Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs
Comment thread Resgrid.Relay.Engine/Voice/RadioMode.cs
Comment thread Resgrid.Relay.Engine/Voice/RecordMode.cs
Comment thread Resgrid.Relay.Engine/Voice/RecordMode.cs
Comment on lines +93 to +97
using var sigTermRegistration = PosixSignalRegistration.Create(PosixSignal.SIGTERM, context =>
{
context.Cancel = true;
cancellationTokenSource.Cancel();
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Bug critical

PlatformNotSupportedException thrown at startup on Windows because PosixSignalRegistration.Create(PosixSignal.SIGTERM) runs without a platform guard and sits outside the try block in Main, causing an unhandled exception since the Engine targets net10.0-windows. Guard the registration with OperatingSystem.IsLinux() or OperatingSystem.IsMacOS() so the SIGTERM handler only registers on supported platforms.

PosixSignalRegistration sigTermRegistration = null;
if (OperatingSystem.IsLinux() || OperatingSystem.IsMacOS())
{
	sigTermRegistration = PosixSignalRegistration.Create(PosixSignal.SIGTERM, context =>
	{
		context.Cancel = true;
		cancellationTokenSource.Cancel();
	});
}
Prompt for LLM

File Resgrid.Audio.Relay.Console/Program.cs:

Line 93 to 97:

WHAT: PosixSignalRegistration.Create(PosixSignal.SIGTERM) is called without a platform guard and sits outside any try/catch, but SIGTERM is not a supported PosixSignal on Windows — .NET throws PlatformNotSupportedException. WHY: The Engine targets net10.0-windows, so running the CLI on Windows now crashes at startup (the throw is before the try block, so Main receives an unhandled exception). HOW: Guard the registration with OperatingSystem.IsLinux() (or wrap in try/catch PlatformNotSupportedException) so the SIGTERM handler only registers on platforms that support it.

Suggested Code:

PosixSignalRegistration sigTermRegistration = null;
if (OperatingSystem.IsLinux() || OperatingSystem.IsMacOS())
{
	sigTermRegistration = PosixSignalRegistration.Create(PosixSignal.SIGTERM, context =>
	{
		context.Cancel = true;
		cancellationTokenSource.Cancel();
	});
}

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

Comment thread Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs
@Resgrid-Bot

Resgrid-Bot commented Jun 27, 2026

Copy link
Copy Markdown

Kody Review Complete

Great news! 🎉
No issues were found that match your current review configurations.

Keep up the excellent work! 🚀

Kody Guide: Usage and Configuration
Interacting with Kody
  • Request a Review: Ask Kody to review your PR manually by adding a comment with the @kody start-review command at the root of your PR.

  • Validate Business Logic: Ask Kody to validate your code against business rules by adding a comment with the @kody -v business-logic command.

  • Provide Feedback: Help Kody learn and improve by reacting to its comments with a 👍 for helpful suggestions or a 👎 if improvements are needed.

Current Kody Configuration
Review Options

The following review options are enabled or disabled:

Options Enabled
Bug
Performance
Security
Business Logic

Access your configuration settings here.

@ucswift ucswift merged commit aacf19f into master Jun 28, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants