From 3b56dd9adb2f07d13d67de1fbb7f3e19124bb7c3 Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Tue, 23 Jun 2026 08:22:11 -0400 Subject: [PATCH 1/5] Add EnvironmentValidator TSG: Test System Drive Free Space New troubleshooting guide for the AzStackHci_Hardware_Test_SystemDrive_Free_Space environment validator check, which fails when an Azure Local machine's system drive (C:) drops below the 30 GB minimum. Covers where the failure surfaces (the Azure portal update readiness view, the AzStackHciEnvironmentChecker event log EventID 17205, and the HealthCheckResult JSON on the infrastructure share), tiered space reclamation with production-safety labels, and re-validation via the Environment Checker module and the pre-update health check. Adds the file to the EnvironmentValidator README index. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- TSG/EnvironmentValidator/README.md | 1 + ...ot-Hardware-Test-SystemDrive-Free-Space.md | 373 ++++++++++++++++++ 2 files changed, 374 insertions(+) create mode 100644 TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md diff --git a/TSG/EnvironmentValidator/README.md b/TSG/EnvironmentValidator/README.md index 72d594c..438859a 100644 --- a/TSG/EnvironmentValidator/README.md +++ b/TSG/EnvironmentValidator/README.md @@ -5,6 +5,7 @@ This folder contains the TSG's related to Environment Validators. * [Troubleshooting External Connectivity Failures in Environment Checker](./Troubleshooting-External-Connectivity-Failures-in-Environment-Checker.md) * [Troubleshooting Test NetAdapter API Failure](./Troubleshooting-Test-NetAdapter-API.md) * [Troubleshooting Test PhysicalDisk API Failure](./Troubleshooting-Test-PhysicalDisk-API.md) +* [Troubleshoot Test System Drive Free Space](./Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md) * [Troubleshooting TestPowerShell Module Version](./Troubleshooting-Test-PowerShell-Module-Version.md) * [Troubleshooting Module Versions](Troubleshooting-Module-Versions.md) * [Troubleshooting MSI Does Not Have Access to Subscription](Troubleshooting-MSI-Does-Not-Have-Access-To-Subscription.md) diff --git a/TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md b/TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md new file mode 100644 index 0000000..ff3cbf3 --- /dev/null +++ b/TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md @@ -0,0 +1,373 @@ +# AzStackHci_Hardware_Test_SystemDrive_Free_Space + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameAzStackHci_Hardware_Test_SystemDrive_Free_Space
Telemetry / health-scanner nameAzStackHci_Hardware_SystemDriveFreeSpace (same check; this is the name used in Azure telemetry and the health-fault scanner)
Display nameTest System Drive Free Space
ComponentHardware (Environment Validator / Environment Checker)
SeverityCritical: this validator blocks deployment and update operations until the machine is back above the minimum.
Required free space30 GB on the system drive (C:) of every machine.
Applicable ScenariosDeployment, Add Node, and Update / Upgrade (pre-update health check).
Affected VersionsAzure Local, version 23H2 and later.
+ +## Overview + +This validator checks that the system drive (the `C:` drive) on each Azure Local +machine has enough free space for the platform to operate and to install updates. +It fails when free space on `C:` drops below the required minimum of **30 GB** on +any machine in the cluster. + +A low system drive is a real problem, not just a warning. While the check is +failing: + +- Solution (Azure Local) updates and upgrades are blocked at pre-update + validation, so the cluster cannot be patched. +- Adding a machine to the cluster can fail validation. +- New Arc and Kubernetes extensions may fail to deploy. +- Existing workloads keep running, but the machine cannot be lifecycle-managed + reliably, and a drive that fills to zero can destabilize the node. + +## Where this failure appears + +You can see this failure in two places, the Azure portal and the machine itself. +Both show the same underlying result. + +### In the Azure portal + +The check runs as part of the update readiness and system health checks, so it +shows up in Azure Update Manager: + +1. Go to **Azure Update Manager > Resources > Azure Local**, or open the **Azure + Local** resource and its **Updates** page. +2. In the system list, select the **Update readiness** status. A system that needs + attention shows a **Critical** or **Warning** state. +3. Review the list of readiness checks. This one appears as **Test System Drive + Free Space**. +4. Select the link under **Details**. The details pane shows the per-machine + results and a **Remediation** link (`https://aka.ms/hci-envch`). + +The portal does not show the raw JSON shown below. It renders the same result as a +row in the readiness check list, with the display name, the Critical severity, the +affected machine, and the remediation link. + +This check is reported in two scenarios, and the results can differ between them +because each uses a different version of the validation logic: + +- **System health checks**, which run once every 24 hours. +- **Update readiness checks**, which run after the update content is downloaded + and before installation. + +### On the machine + +Two on-box sources carry the result. + +**Event log (per machine).** The Environment Checker writes every check result to +the **AzStackHciEnvironmentChecker** event log, located at +`C:\Windows\System32\winevt\Logs\AzStackHciEnvironmentChecker.evtx`. Each result is +the JSON body of an **Event ID 17205** entry. To read this check's most recent +result on a machine: + +```powershell +Get-WinEvent -LogName AzStackHciEnvironmentChecker -FilterXPath '*[System[(EventID=17205)]]' -MaxEvents 2000 | + Where-Object { $_.Message -match 'AzStackHci_Hardware_Test_SystemDrive_Free_Space' } | + Select-Object -First 1 -ExpandProperty Message +``` + +**Pre-update health check result file (cluster-wide).** The pre-update health +check writes its full result set to the cluster infrastructure share: + +``` +C:\ClusterStorage\Infrastructure_1\Shares\SU1_Infrastructure_1\Updates\HealthCheck\System\HealthCheckResult.EnvironmentChecker..json +``` + +This file is on cluster storage, so it is the same from any machine in the +cluster. The newest `HealthCheckResult.EnvironmentChecker.*.json` holds the latest +run. (A separate `HealthCheckResult.CheckCloudHealth.*.json` covers other checks +and does not contain this one.) + +In both sources the result for this check looks like this: + +```json +{ + "Name": "AzStackHci_Hardware_Test_SystemDrive_Free_Space", + "DisplayName": "Test System Drive Free Space", + "Title": "Test System Drive Free Space", + "Severity": "Critical", + "Status": "FAILURE", + "Description": "Checking System Drive Free Space", + "TargetResourceType": "Disk", + "TargetResourceName": "Machine: AzL-Node-01, Class: Disk, DriveLetter: C:", + "Remediation": "https://aka.ms/hci-envch", + "AdditionalData": { + "Detail": "Checking Hostname AzL-Node-01 for free space on root folder path 'C:' 25 GB. Expected at least 30 GB.", + "Status": "FAILURE", + "Resource": "AzL-Node-01" + } +} +``` + +The `Detail` line is the key part. It names the machine (`AzL-Node-01` above), the +free space it found (25 GB), and the minimum it expected (30 GB). A passing result +has `Status` of `0` or `SUCCESS`; a failing result has a non-zero status or +`FAILURE`. + +## Requirements + +1. Each Azure Local machine must have at least **30 GB** free on its system drive + (`C:`). +2. You run the steps below on the affected machine, signed in as an administrator, + in a PowerShell session. + +## Troubleshooting Steps + +### 1. Confirm which machine is low + +Check the free space directly on each machine: + +```powershell +Get-PSDrive C | Select-Object @{n='FreeGB';e={[math]::Round($_.Free/1GB,1)}}, + @{n='UsedGB';e={[math]::Round($_.Used/1GB,1)}} +``` + +If `FreeGB` is below 30, this check will fail on that machine. To check every +machine in the cluster at once: + +```powershell +Invoke-Command -ComputerName (Get-ClusterNode).Name -ScriptBlock { + [pscustomobject]@{ Node = $env:COMPUTERNAME + FreeGB = [math]::Round((Get-PSDrive C).Free/1GB,1) } +} | Sort-Object FreeGB +``` + +### 2. Find what is using the system drive + +Before deleting anything, see where the space went. These commands are read-only. + +```powershell +# Largest top-level folders on C: (this recursive scan can take a minute or two) +Get-ChildItem C:\ -Directory -Force -ErrorAction SilentlyContinue | ForEach-Object { + $b = (Get-ChildItem $_.FullName -Recurse -File -Force -ErrorAction SilentlyContinue | + Measure-Object Length -Sum).Sum + [pscustomobject]@{ Folder = $_.Name; GB = [math]::Round($b/1GB,2) } +} | Sort-Object GB -Descending | Select-Object -First 12 + +# How much the Windows component store (WinSxS) can reclaim +Dism.exe /Online /Cleanup-Image /AnalyzeComponentStore + +# Largest Windows event logs +Get-ChildItem C:\Windows\System32\winevt\Logs -File | Sort-Object Length -Descending | + Select-Object -First 8 Name, @{n='GB';e={[math]::Round($_.Length/1GB,2)}} +``` + +On an Azure Local machine the usual large consumers are the Windows folder +(including the WinSxS component store), the monitoring agent cache +(`C:\GMACache`), Windows event logs, and the Windows Update download cache. + +One specific cause worth ruling out is leftover Environment Checker package folders +piling up under the orchestrator's temp directory. If you see many folders there, +follow the dedicated guide: +[Known Issue: High Disk Space Usage in TEMP](./Known-Issue-High-Disk-Space-usage-in-TEMP.md). + +### 3. Reclaim space safely + +Work top to bottom. Tier 1 is safe and Microsoft-supported. Stop once the machine +is back above 30 GB free with some margin. + +**Production safety at a glance.** None of the steps below require cluster downtime +or a reboot. A few need light coordination: + +| Action | Safe while fully in production? | +| --- | --- | +| Tier 1a: WinSxS component cleanup | Yes, no reboot. It is IO and CPU intensive and can take several minutes, so prefer a quieter period. | +| Tier 1b: clear Windows Update cache | Yes, but not while a solution update or upgrade is in progress, because it briefly stops the Windows Update and BITS services. | +| Tier 1c: remove crash dumps | Yes, deletes files only. | +| Tier 1d: clear temporary files | Yes, deletes files only. | +| Tier 2: clear large event logs | Yes for uptime, but this erases diagnostic and audit history, and clearing the Security log has compliance implications. Export first. | +| Tier 3: platform-managed areas | Do not delete. Fixing the cause has no workload impact. | + +If a machine is already near zero free space and at risk of dropping out of the +cluster, treat that one machine as a maintenance action: pause and drain it first +so its workloads move to other machines, then clean up, then resume. The cluster +stays in production throughout, because the workloads live-migrate. + +```powershell +Suspend-ClusterNode -Name -Drain # move workloads off this machine +# ... run the cleanup steps below ... +Resume-ClusterNode -Name # return the machine to service +``` + +#### Tier 1: safe to reclaim now + +**a. Clean the Windows component store (WinSxS).** This removes superseded update +components and is fully supported. It is usually the largest safe win. Safe to run +while fully in production with no reboot; it is IO and CPU intensive and can take +several minutes to complete, so prefer a quieter period. A small number of packages +can need a reboot to finish, so if the analysis still reports reclaimable packages +afterward, a maintenance reboot completes the cleanup. + +```powershell +Dism.exe /Online /Cleanup-Image /StartComponentCleanup +``` + +**b. Clear the Windows Update download cache.** Safe to clear; Windows re-downloads +what it needs. Do not run this while a solution update or upgrade is in progress, +because it briefly stops the Windows Update (`wuauserv`) and BITS services. Outside +an active update there is no workload impact. + +```powershell +Stop-Service wuauserv, bits +Remove-Item 'C:\Windows\SoftwareDistribution\Download\*' -Recurse -Force -ErrorAction SilentlyContinue +Start-Service wuauserv, bits +``` + +**c. Remove crash dumps.** Collect them first only if you have an open support case +that needs them. Safe in production; this deletes files only. + +```powershell +Remove-Item C:\Windows\MEMORY.DMP -Force -ErrorAction SilentlyContinue +Remove-Item C:\Windows\Minidump\* -Force -ErrorAction SilentlyContinue +Remove-Item C:\Windows\LiveKernelReports\* -Recurse -Force -ErrorAction SilentlyContinue +Remove-Item "$env:ProgramData\Microsoft\Windows\WER\ReportQueue\*" -Recurse -Force -ErrorAction SilentlyContinue +``` + +**d. Clear temporary files.** Safe in production; this deletes files only, and +files in use are skipped. + +```powershell +Remove-Item C:\Windows\Temp\* -Recurse -Force -ErrorAction SilentlyContinue +Remove-Item $env:TEMP\* -Recurse -Force -ErrorAction SilentlyContinue +``` + +#### Tier 2: diagnostic logs (reclaim with care) + +Large event logs such as `Microsoft-Windows-FailoverClustering%4Diagnostic` and +`Security` can each be 1 GB or more. They hold troubleshooting history and they +regrow to their configured maximum size, so clearing them is a temporary gain. +Clearing a log needs no reboot or downtime, but it erases troubleshooting and audit +history, and clearing the Security log has compliance implications, so treat it as +a data-retention decision. + +If you do not need the history, clear the log directly: + +```powershell +wevtutil clear-log 'Microsoft-Windows-FailoverClustering/Diagnostic' # example +``` + +If you want to keep the history, export first, then clear. Write the export to a +volume other than `C:` or to a network share, because the export is the same size +as the log (often 1 GB or more), so writing it to `C:` would consume the very space +you are trying to reclaim. Delete the export once you confirm you no longer need it. + +```powershell +$log = 'Microsoft-Windows-FailoverClustering/Diagnostic' # example +$dest = 'D:\logbackup' # any non-C: volume or share +New-Item -ItemType Directory $dest -Force | Out-Null +wevtutil export-log $log (Join-Path $dest (($log -replace '/','_') + '.evtx')) /overwrite:true +wevtutil clear-log $log +``` + +Do not disable or permanently shrink platform diagnostic logs without guidance, +because they are needed to investigate cluster issues. + +#### Tier 3: platform-managed areas (do not delete; find the cause) + +Some large folders are managed by the platform. Deleting them can break monitoring +or updates, and it does not fix the underlying cause. + +- **`C:\GMACache` (monitoring agent cache).** A large `GMACache`, especially + `GMACache\TelemetryCache`, usually means the machine cannot upload telemetry to + Azure, so the data backs up on disk. The fix is to restore outbound connectivity + and the Arc connection so the cache drains on its own. Do not delete the cache to + free space; that loses buffered data, and the folder simply refills while + connectivity is broken. +- **`C:\Observability`, `C:\NugetStore`, `C:\ImageComposition`, `C:\CloudContent`, + `C:\Agents`.** These hold platform logs, solution packages, and update content. + They are managed and rotated automatically. Do not delete them. If one of them is + unusually large, open a support case rather than removing files. + +### 4. Verify the fix + +First confirm the machine is back above the minimum: + +```powershell +Get-PSDrive C | Select-Object @{n='FreeGB';e={[math]::Round($_.Free/1GB,1)}} +``` + +Then re-validate. You have two options. + +**Fast: run just this one validator.** The Environment Checker module ships on every +Azure Local machine, so you can run this single hardware check directly and get a +result back in a few seconds, without running the full pre-update health check: + +```powershell +$r = Invoke-AzStackHciHardwareValidation -Include Test-SystemDriveFreeSpace -PassThru +$r | Select-Object Name, Status, Severity +$r.AdditionalData.Detail +``` + +A healthy machine returns `Status` of `SUCCESS` and a detail line like +`Checking Hostname for free space on root folder path 'C:' 56 GB. Expected at least 30 GB.` +This is the quickest way to confirm your cleanup worked on the machine you just +fixed. (`-Include Test-SystemDriveFreeSpace` runs only this check; drop the +`-Include` to run the full hardware validation.) + +**Authoritative: re-run the pre-update health check.** This is what the portal +readiness view and the cluster-wide result file reflect, so run it to clear the +failure everywhere it is reported. It runs the full readiness check, so allow +several minutes for the results to refresh: + +```powershell +Invoke-SolutionUpdatePrecheck +``` + +After the re-run, **Test System Drive Free Space** should report success. You can +confirm it in any of the places listed under [Where this failure +appears](#where-this-failure-appears): the portal readiness checks, the +`AzStackHciEnvironmentChecker` event log (Event ID 17205), or the newest +`HealthCheckResult.EnvironmentChecker.*.json` on the infrastructure share. + +If it still fails, repeat step 2 to see what refilled the drive. A drive that +refills quickly is usually caused by a backed-up `GMACache` (a connectivity +problem) or a runaway log, not a one-time pile of files. + +## When to escalate + +Open a support case if any of the following are true: + +- The drive refills faster than you can reclaim it, even after you fix outbound + connectivity. +- A platform-managed folder (Tier 3) is the dominant consumer, and you cannot find + a connectivity or update cause. +- The machine is at or near zero free space and will not boot or stay in the + cluster. + +## Related + +- [Known Issue: High Disk Space Usage in TEMP](./Known-Issue-High-Disk-Space-usage-in-TEMP.md) +- General Environment Checker remediation link shown in the validator output: + https://aka.ms/hci-envch From cedacb793093821ebffa17da0e48582f26a7b122 Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Tue, 23 Jun 2026 08:36:35 -0400 Subject: [PATCH 2/5] Address review nits: align filename to sibling convention, add capacity link - Rename to Troubleshooting-Test-SystemDrive-Free-Space.md to match the EnvironmentValidator sibling naming (Troubleshooting-Test-) and drop the outlier Hardware- segment; update the README index link text and path. - Add the Azure Local low-capacity requirements link (https://aka.ms/azurelocallowcapacityrequirements) to Related. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- TSG/EnvironmentValidator/README.md | 2 +- ...-Space.md => Troubleshooting-Test-SystemDrive-Free-Space.md} | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) rename TSG/EnvironmentValidator/{Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md => Troubleshooting-Test-SystemDrive-Free-Space.md} (99%) diff --git a/TSG/EnvironmentValidator/README.md b/TSG/EnvironmentValidator/README.md index 438859a..11611c1 100644 --- a/TSG/EnvironmentValidator/README.md +++ b/TSG/EnvironmentValidator/README.md @@ -5,7 +5,7 @@ This folder contains the TSG's related to Environment Validators. * [Troubleshooting External Connectivity Failures in Environment Checker](./Troubleshooting-External-Connectivity-Failures-in-Environment-Checker.md) * [Troubleshooting Test NetAdapter API Failure](./Troubleshooting-Test-NetAdapter-API.md) * [Troubleshooting Test PhysicalDisk API Failure](./Troubleshooting-Test-PhysicalDisk-API.md) -* [Troubleshoot Test System Drive Free Space](./Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md) +* [Troubleshooting Test System Drive Free Space](./Troubleshooting-Test-SystemDrive-Free-Space.md) * [Troubleshooting TestPowerShell Module Version](./Troubleshooting-Test-PowerShell-Module-Version.md) * [Troubleshooting Module Versions](Troubleshooting-Module-Versions.md) * [Troubleshooting MSI Does Not Have Access to Subscription](Troubleshooting-MSI-Does-Not-Have-Access-To-Subscription.md) diff --git a/TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md b/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md similarity index 99% rename from TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md rename to TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md index ff3cbf3..dfe0180 100644 --- a/TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md @@ -371,3 +371,5 @@ Open a support case if any of the following are true: - [Known Issue: High Disk Space Usage in TEMP](./Known-Issue-High-Disk-Space-usage-in-TEMP.md) - General Environment Checker remediation link shown in the validator output: https://aka.ms/hci-envch +- Azure Local low-capacity requirements: + https://aka.ms/azurelocallowcapacityrequirements From 9ac32c979c36503fc5e137aa58dc5c4954ddf54c Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Tue, 23 Jun 2026 09:29:58 -0400 Subject: [PATCH 3/5] Address reviewer feedback: robust WU-cache cleanup, export-path placeholder - Wrap the Windows Update cache clear in try/finally with -ErrorAction Stop on the service stop, so wuauserv/bits are always restarted even if Remove-Item fails or the session is interrupted. - Replace the concrete D:\logbackup export path with a placeholder, since a node may not have a D: volume; reinforces that the destination must be a non-C: volume or a network share. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../Troubleshooting-Test-SystemDrive-Free-Space.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md b/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md index dfe0180..30a06ad 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md @@ -240,9 +240,12 @@ because it briefly stops the Windows Update (`wuauserv`) and BITS services. Outs an active update there is no workload impact. ```powershell -Stop-Service wuauserv, bits -Remove-Item 'C:\Windows\SoftwareDistribution\Download\*' -Recurse -Force -ErrorAction SilentlyContinue -Start-Service wuauserv, bits +Stop-Service wuauserv, bits -ErrorAction Stop +try { + Remove-Item 'C:\Windows\SoftwareDistribution\Download\*' -Recurse -Force -ErrorAction SilentlyContinue +} finally { + Start-Service wuauserv, bits +} ``` **c. Remove crash dumps.** Collect them first only if you have an open support case @@ -285,7 +288,7 @@ you are trying to reclaim. Delete the export once you confirm you no longer need ```powershell $log = 'Microsoft-Windows-FailoverClustering/Diagnostic' # example -$dest = 'D:\logbackup' # any non-C: volume or share +$dest = '' # e.g. E:\logbackup or \\server\share (must not be C:) New-Item -ItemType Directory $dest -Force | Out-Null wevtutil export-log $log (Join-Path $dest (($log -replace '/','_') + '.evtx')) /overwrite:true wevtutil clear-log $log From 3de8ef836aa96276d5d5b898eb139312fbd2c1aa Mon Sep 17 00:00:00 2001 From: John Neemes Date: Wed, 24 Jun 2026 11:01:32 -0400 Subject: [PATCH 4/5] Address review: build-robust event/portal names in the diagnostic section Per AlBurns-MSFT's live validation on build 10.2607, the EventID 17205 body and the portal readiness row surface the telemetry/health-scanner name (AzStackHci_Hardware_SystemDriveFreeSpace / "System Drive Free Space"), while earlier builds emit the env-checker name (the longer "..._Test_..." form). Make the on-box diagnostics work on any build: - Get-WinEvent filter now matches either name (regex alternation), so it returns the record regardless of build. - Example JSON Name/DisplayName/Title updated to the current-build form, with a [!NOTE] documenting the earlier form and that the Detail line is identical. - Portal "appears as" text notes both the current and earlier display names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...bleshooting-Test-SystemDrive-Free-Space.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md b/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md index 30a06ad..66d943a 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md @@ -66,8 +66,8 @@ shows up in Azure Update Manager: Local** resource and its **Updates** page. 2. In the system list, select the **Update readiness** status. A system that needs attention shows a **Critical** or **Warning** state. -3. Review the list of readiness checks. This one appears as **Test System Drive - Free Space**. +3. Review the list of readiness checks. On current builds this appears as **System + Drive Free Space** (earlier builds: **Test System Drive Free Space**). 4. Select the link under **Details**. The details pane shows the per-machine results and a **Remediation** link (`https://aka.ms/hci-envch`). @@ -94,7 +94,7 @@ result on a machine: ```powershell Get-WinEvent -LogName AzStackHciEnvironmentChecker -FilterXPath '*[System[(EventID=17205)]]' -MaxEvents 2000 | - Where-Object { $_.Message -match 'AzStackHci_Hardware_Test_SystemDrive_Free_Space' } | + Where-Object { $_.Message -match 'AzStackHci_Hardware_(Test_SystemDrive_Free_Space|SystemDriveFreeSpace)' } | Select-Object -First 1 -ExpandProperty Message ``` @@ -114,9 +114,9 @@ In both sources the result for this check looks like this: ```json { - "Name": "AzStackHci_Hardware_Test_SystemDrive_Free_Space", - "DisplayName": "Test System Drive Free Space", - "Title": "Test System Drive Free Space", + "Name": "AzStackHci_Hardware_SystemDriveFreeSpace", + "DisplayName": "System Drive Free Space", + "Title": "System Drive Free Space", "Severity": "Critical", "Status": "FAILURE", "Description": "Checking System Drive Free Space", @@ -131,6 +131,13 @@ In both sources the result for this check looks like this: } ``` +> [!NOTE] +> The `Name`, `DisplayName`, and `Title` vary by build. Current builds emit the +> telemetry / health-scanner name shown above (`AzStackHci_Hardware_SystemDriveFreeSpace` / +> `System Drive Free Space`); earlier builds emit the env-checker name +> (`AzStackHci_Hardware_Test_SystemDrive_Free_Space` / `Test System Drive Free Space`). The +> `Detail` line is identical on both, and the `Get-WinEvent` filter above matches either name. + The `Detail` line is the key part. It names the machine (`AzL-Node-01` above), the free space it found (25 GB), and the minimum it expected (30 GB). A passing result has `Status` of `0` or `SUCCESS`; a failing result has a non-zero status or From a5cca299874678a11bb9343e0c554c7bbce9b7b4 Mon Sep 17 00:00:00 2001 From: John Neemes Date: Wed, 24 Jun 2026 12:01:41 -0400 Subject: [PATCH 5/5] TSG: note the portal can show a stale failure right after reclaiming space Add a callout to "Verify the fix" clarifying that the portal readiness view and the HealthCheckResult.EnvironmentChecker JSON reflect the LAST health check, so they can keep showing the failure until refreshed (by the pre-update health check or the next scheduled periodic health check). The fast targeted validator reflects live free space immediately, so it is the right way to confirm a fix without waiting on the portal. Learning from a live detect -> mitigate -> revalidate validation of this TSG on a lab cluster, where the targeted check returned SUCCESS immediately while the cluster-wide HealthCheckResult JSON was still ~20h stale. --- .../Troubleshooting-Test-SystemDrive-Free-Space.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md b/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md index 66d943a..82ac1c9 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Test-SystemDrive-Free-Space.md @@ -361,6 +361,14 @@ appears](#where-this-failure-appears): the portal readiness checks, the `AzStackHciEnvironmentChecker` event log (Event ID 17205), or the newest `HealthCheckResult.EnvironmentChecker.*.json` on the infrastructure share. +> **The portal can show a stale failure right after you reclaim space.** The +> portal readiness view and the `HealthCheckResult.EnvironmentChecker.*.json` file +> report the result of the *last* health check, so they keep showing the failure +> until that result is refreshed, either by the pre-update health check above or by +> the next scheduled periodic health check (roughly once a day). The fast targeted +> check reflects the machine's live free space immediately, so use it to confirm +> your fix and do not wait on the portal to update. + If it still fails, repeat step 2 to see what refilled the drive. A drive that refills quickly is usually caused by a backed-up `GMACache` (a connectivity problem) or a runaway log, not a one-time pile of files.