Skip to content

Add EnvironmentValidator TSG: Test System Drive Free Space#302

Merged
1008covingtonlane merged 5 commits into
Azure:mainfrom
1008covingtonlane:tsg-hardware-systemdrive-free-space
Jun 24, 2026
Merged

Add EnvironmentValidator TSG: Test System Drive Free Space#302
1008covingtonlane merged 5 commits into
Azure:mainfrom
1008covingtonlane:tsg-hardware-systemdrive-free-space

Conversation

@1008covingtonlane

Copy link
Copy Markdown
Collaborator

What

Adds a troubleshooting guide for the AzStackHci_Hardware_Test_SystemDrive_Free_Space Environment Validator check, which fails when an Azure Local machine's system drive (C:) drops below the required 30 GB of free space. The check is Critical and blocks deployment and update operations until resolved, and there was no dedicated TSG for it in the EnvironmentValidator folder.

What is covered

  • Where the failure appears: the Azure portal update readiness view (Azure Update Manager), the AzStackHciEnvironmentChecker event log (Event ID 17205), and the HealthCheckResult.EnvironmentChecker.*.json result file on the infrastructure share. Includes the validator result JSON and how to read the Detail line.
  • Confirm which machine is low and find what is consuming the drive with read-only discovery commands.
  • Tiered, production-safety-labeled reclamation: safe Microsoft-supported cleanup first (WinSxS component cleanup, Windows Update cache, crash dumps, temp), diagnostic logs with care, and platform-managed areas left alone with root-cause guidance. Each step notes whether it is safe while fully in production and which can run for several minutes.
  • Re-validation two ways: the fast single-validator path (Invoke-AzStackHciHardwareValidation -Include Test-SystemDriveFreeSpace -PassThru) and the authoritative Invoke-SolutionUpdatePrecheck.
  • Cross-links the existing "High Disk Space Usage in TEMP" known issue.

All PowerShell examples were tested on a live Azure Local cluster. The file is also added to the EnvironmentValidator README index.

New troubleshooting guide for the AzStackHci_Hardware_Test_SystemDrive_Free_Space
environment validator check, which fails when an Azure Local machine's system
drive (C:) drops below the 30 GB minimum. Covers where the failure surfaces (the
Azure portal update readiness view, the AzStackHciEnvironmentChecker event log
EventID 17205, and the HealthCheckResult JSON on the infrastructure share),
tiered space reclamation with production-safety labels, and re-validation via the
Environment Checker module and the pre-update health check. Adds the file to the
EnvironmentValidator README index.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 23, 2026 12:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Environment Validator troubleshooting guide (TSG) for the AzStackHci_Hardware_Test_SystemDrive_Free_Space check (30 GB free space requirement on C:), and indexes it in the EnvironmentValidator README so it’s discoverable alongside existing validators/known-issues documentation.

Changes:

  • Introduces a new TSG documenting symptoms, where to find the failure output (portal + on-box), discovery, tiered remediation, and re-validation steps.
  • Adds the new TSG link to TSG/EnvironmentValidator/README.md.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md New end-to-end TSG for diagnosing and remediating low C: free space that blocks Azure Local deployment/updates.
TSG/EnvironmentValidator/README.md Adds an index entry pointing to the new TSG.

Comment thread TSG/EnvironmentValidator/README.md Outdated
1008covingtonlane and others added 2 commits June 23, 2026 08:36
…ty link

- Rename to Troubleshooting-Test-SystemDrive-Free-Space.md to match the
  EnvironmentValidator sibling naming (Troubleshooting-Test-<Name>) and drop the
  outlier Hardware- segment; update the README index link text and path.
- Add the Azure Local low-capacity requirements link
  (https://aka.ms/azurelocallowcapacityrequirements) to Related.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…holder

- Wrap the Windows Update cache clear in try/finally with -ErrorAction Stop on the
  service stop, so wuauserv/bits are always restarted even if Remove-Item fails or
  the session is interrupted.
- Replace the concrete D:\logbackup export path with a placeholder, since a node
  may not have a D: volume; reinforces that the destination must be a non-C: volume
  or a network share.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@1008covingtonlane

Copy link
Copy Markdown
Collaborator Author

Thanks for the review. Addressed in 9ac32c9:

  1. WU-cache cleanup robustness (Clear the Windows Update download cache): wrapped the clear in try/finally with -ErrorAction Stop on the service stop, so wuauserv/bits are always restarted even if Remove-Item fails or the session is interrupted.

  2. Hard-coded D:\logbackup export path: replaced with a <NON_C_DRIVE_OR_SHARE> placeholder plus an example, reinforcing that the destination must be a non-C: volume or a network share (a node may not have a D:).

  3. README "Troubleshoot" vs "Troubleshooting": already aligned to the sibling convention in commit cedacb7; the current index entry reads "Troubleshooting Test System Drive Free Space".

The updated Windows Update cache mitigation block was re-validated end-to-end on a live Azure Local cluster: the drive was driven below 30 GB, the TSG's own clear reclaimed the space (about 37 GB), and the check returned to SUCCESS.

@AlBurns-MSFT

Copy link
Copy Markdown
Collaborator

Validated this TSG end-to-end on a live Azure Local lab cluster (build 10.2607) — the remediation works, and I have one accuracy issue to flag in the diagnostic section.

I reproduced the failure and recovery exactly as written: filled the system drive below the threshold → the check reported FAILURE with the documented detail line (…free space on root folder path 'C:' NN GB. Expected at least 30 GB.) → ran the Tier 1 "Clear the Windows Update download cache" block verbatim (including the new try/finally) → re-validated with the step 4 command → SUCCESS. The fast verify command (Invoke-AzStackHciHardwareValidation -Include Test-SystemDriveFreeSpace -PassThru) works standalone, no extra parameters needed.

I also exercised the multi-node guidance on a four-node cluster:

  • The cluster-wide check in step 1 (Invoke-Command -ComputerName (Get-ClusterNode).Name { … }) returns the per-machine free space, sorted, as shown.
  • The maintenance path in step 3 (Suspend-ClusterNode -Name <node> -Drain → cleanup → Resume-ClusterNode) drained a node to Paused and brought it back to Up with the cluster healthy throughout.

The Tier 3 "do not delete" call-out is accurate — on a real node the large platform-managed folders it names (e.g. GMACache, NugetStore, CloudContent, Agents) are exactly what a top-level scan surfaces, so steering people away from them is right.

One issue — the names in the "On the machine" section don't match current builds. On build 10.2607, the result is emitted under:

  • Name = AzStackHci_Hardware_SystemDriveFreeSpace (not AzStackHci_Hardware_Test_SystemDrive_Free_Space)
  • Title / DisplayName = System Drive Free Space (not Test System Drive Free Space)

Two consequences a reader will hit:

  1. The documented Event Log command filters on -match 'AzStackHci_Hardware_Test_SystemDrive_Free_Space', which returns nothing on this build (that string isn't present in the event body). Filtering on AzStackHci_Hardware_SystemDriveFreeSpace returns the record.
  2. The example JSON's Name/DisplayName/Title, and the portal step that says the check "appears as Test System Drive Free Space", won't match what the reader sees — the portal row reads System Drive Free Space.

Caveat: this may be build drift — an earlier build may have emitted the longer …_Test_… form. Worth confirming which build the doc targets and aligning the example JSON and the Get-WinEvent filter to the name the product actually emits, so the on-box diagnostic step works as written. (The -Include Test-SystemDriveFreeSpace token in the verify command is correct and unaffected.)

Net: accurate and remediation-precise; just the diagnostic-section names need a refresh for current builds.

…tion

Per AlBurns-MSFT's live validation on build 10.2607, the EventID 17205 body
and the portal readiness row surface the telemetry/health-scanner name
(AzStackHci_Hardware_SystemDriveFreeSpace / "System Drive Free Space"),
while earlier builds emit the env-checker name (the longer "..._Test_..."
form). Make the on-box diagnostics work on any build:

- Get-WinEvent filter now matches either name (regex alternation), so it
  returns the record regardless of build.
- Example JSON Name/DisplayName/Title updated to the current-build form, with
  a [!NOTE] documenting the earlier form and that the Detail line is identical.
- Portal "appears as" text notes both the current and earlier display names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@1008covingtonlane

Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough live validation, Alex, and the build-drift catch. I confirmed it both ways: an earlier build emits the env-checker name in the event body (AzStackHci_Hardware_Test_SystemDrive_Free_Space), and your build 10.2607 emits the telemetry/health-scanner name (AzStackHci_Hardware_SystemDriveFreeSpace / System Drive Free Space), so this is a rename across builds rather than an error in either.

Pushed a build-robust fix (3de8ef8) so the diagnostic section works on any build:

  • The Get-WinEvent filter now matches either name (AzStackHci_Hardware_(Test_SystemDrive_Free_Space|SystemDriveFreeSpace)), so it returns the record on 10.2607 and on earlier builds.
  • The example JSON Name/DisplayName/Title now show the current-build form you observed, with a [!NOTE] documenting the earlier form and that the Detail line is identical on both.
  • The portal "appears as" step now lists both display names.

The -Include Test-SystemDriveFreeSpace token in the verify command is unaffected, as you noted. Let me know if the refreshed section matches what you see on 10.2607.

…space

Add a callout to "Verify the fix" clarifying that the portal readiness view
and the HealthCheckResult.EnvironmentChecker JSON reflect the LAST health
check, so they can keep showing the failure until refreshed (by the
pre-update health check or the next scheduled periodic health check). The
fast targeted validator reflects live free space immediately, so it is the
right way to confirm a fix without waiting on the portal.

Learning from a live detect -> mitigate -> revalidate validation of this TSG
on a lab cluster, where the targeted check returned SUCCESS immediately while
the cluster-wide HealthCheckResult JSON was still ~20h stale.
@1008covingtonlane

Copy link
Copy Markdown
Collaborator Author

Live end-to-end validation: grade A

I validated this TSG end to end on a live 2-node Azure Local lab cluster using an automated detect -> mitigate -> revalidate harness driven by this PR's own published guidance. Summary: the documented bad state trips the real Environment Validator check, the failure is discoverable exactly where the TSG says, the TSG's own mitigation reclaims the space, and the check returns to SUCCESS.

phase result detail
baseline PASS Invoke-AzStackHciHardwareValidation -Include Test-SystemDriveFreeSpace green (C: 62 GB)
inject done drove C: free 61.62 -> 24.62 GB (a 37 GB consumer in the WU download cache; an absolute 12 GB floor was respected)
detect FAIL the real check reported ... free space on root folder path 'C:' 25 GB. Expected at least 30 GB.
discoverability confirmed a fresh Event ID 17205 (status 1) was written to the AzStackHciEnvironmentChecker log, matching the "Where this failure appears" section
mitigate done ran the TSG's own Tier-1 step (stop wuauserv/bits, clear C:\Windows\SoftwareDistribution\Download\*, restart)
revalidate PASS the real check returned SUCCESS (C: back to 62 GB)

Notes fed back into the TSG (latest commit): during the run the fast targeted validator returned SUCCESS immediately, while the cluster-wide HealthCheckResult.EnvironmentChecker.*.json (the portal source) was still ~20h stale. I added a callout to "Verify the fix" so customers know the portal reflects the last health check and can show a stale failure right after a fix, and that the targeted check is the immediate confirmation.

The TSG's other reclamation tiers (DISM component cleanup, dump/log removal, etc.) are documented for other consumers and were not exercised by this single disk-pressure injection.

INTERNAL note: validated on lab cluster b88rb1605; this comment contains no customer data.

@AlBurns-MSFT AlBurns-MSFT left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — thanks for the quick turnaround on the build-drift fix. Verified the refreshed diagnostic section: both check-name forms (AzStackHci_Hardware_Test_SystemDrive_Free_Space and AzStackHci_Hardware_SystemDriveFreeSpace) are present in the shipping AzStackHci.EnvironmentChecker module, so the match-either Get-WinEvent filter is correct, and the cmdlets all verify. Approving.

@1008covingtonlane 1008covingtonlane merged commit cd6a6ea into Azure:main Jun 24, 2026
1 check passed
@1008covingtonlane 1008covingtonlane deleted the tsg-hardware-systemdrive-free-space branch June 24, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants