Skip to content

Group CLI crash reports on structured error signals#7901

Open
stephanie-shopify wants to merge 5 commits into
mainfrom
fix/error-grouping-structured
Open

Group CLI crash reports on structured error signals#7901
stephanie-shopify wants to merge 5 commits into
mainfrom
fix/error-grouping-structured

Conversation

@stephanie-shopify

@stephanie-shopify stephanie-shopify commented Jun 23, 2026

Copy link
Copy Markdown

WHY are these changes introduced?

Fixes #7891. which comes from resiliency issue

A single Bugsnag bucket (groupingHash 16864652937831232783) had become a catch-all for ~1,170 unrelated GraphQL/API errors spanning app/theme/store/hydrogen. That makes the bucket un-routable (five owning teams in one group), mixes expected/transient noise with genuine regressions, and manufactures false P1s.

The earlier fix attempt (#7894) addressed the symptom by regex-reparsing the stringified error message to reconstruct type information, and removed expected-error suppression entirely (reporting every AbortError). This PR is the redesign: it groups on the structured signals the error already carries.

The key insight: in sendErrorToBugsnag, only the reportable copy is flattened to new Error(error.message) — the original typed error is still in scope and already carries statusCode + errors[].extensions.code. So we group on facts, not on a string we just serialized.

WHAT is this pull request doing?

  • New error-grouping.ts — derives {httpStatus, code, errorClass} from the original error (GraphQLClientError and raw graphql-request ClientError) and maps them via an explicit decision table:
    • THROTTLED / 429rate_limit
    • 401authentication
    • 403 / ACCESS_DENIEDpermission
    • 5xxserver
    • else → shared keyword categorizer (fallback only)
    • unknown → leave groupingHash unset so Bugsnag's stack-trace grouping applies and distinct unknown bugs aren't merged.
  • error-handler.ts — sets event.groupingHash only when a real category resolves, and emits error_grouping metadata (slice_name, http_status, error_code, error_class) so backend dashboards/routing work regardless of CLI version (the hash is client-side; old versions keep the old hash until adoption rolls).
  • error.ts — raw ClientError with 401/429/THROTTLED is now treated as expected and kept out of crash reporting.
  • The shared error-categorizer.ts / storage.ts (Monorail analytics taxonomy) are untouched — their tests stay green, proving the error:${category}:${signature} events are unaffected.
  • A short error-grouping.DESIGN.md records the problem, constraints, and the three decisions.

How to test your changes?

pnpm vitest run packages/cli-kit/src/private/node/analytics/error-grouping.test.ts
pnpm vitest run packages/cli-kit/src/public/node/error-handler.test.ts
pnpm vitest run packages/cli-kit/src/public/node/error.test.ts
pnpm vitest run packages/cli-kit/src/private/node/analytics/   # categorizer + storage stay green

Decision-table matrix (asserted in tests): 403 ACCESS_DENIEDtheme:permission:http-403-access-denied; 401 → *:authentication:http-401; THROTTLED*:rate_limit:* and not reported; 5xx → server; raw ClientError; generic ErrorgroupingHash unset (stack grouping).

Rollout note

Setting groupingHash re-buckets all CLI errors once — a one-time grouping migration across CLI error dashboards. This is intended and an improvement, but should be flagged to whoever owns the Bugsnag/Observe dashboards before merge. The structured metadata tags give backend routing that works across all CLI versions immediately, which is the mitigation while client adoption rolls.

Follow-ups (out of scope)

  • At-source category/code on the FatalError hierarchy + API wrappers (the "proper" version of this).
  • Folding status/code grouping into error-categorizer.ts (needs a coordinated storage.ts update + pinned tests).
  • Alert on *:unknown:* and cli:* bucket growth so the catch-all surfaces itself.
  • Tie Bugsnag severity to SLO/error-budget burn rather than aggregate volume.

Checklist

  • I've considered possible cross-platform impacts (Mac, Linux, Windows) — pure TS, no platform-specific paths.
  • I've considered possible documentation changes — none needed (internal crash-reporting behavior).
  • I've considered analytics changes to measure impact — the shared Monorail taxonomy is deliberately untouched; new Bugsnag error_grouping metadata is additive.
  • The change is user-facing — patch bump added via changeset.

🤖 Generated with Claude Code

Replaces the single catch-all Bugsnag bucket (~1,170 unrelated GraphQL/API
errors across app/theme/store/hydrogen) with grouping derived from the original
typed error's HTTP status and GraphQL extensions.code, rather than regex-
reparsing a stringified message.

- New error-grouping.ts: structured-first decision table
  (THROTTLED/429->rate_limit, 401->authentication, 403/ACCESS_DENIED->permission,
  5xx->server), shared keyword categorizer as fallback, and undefined for unknown
  errors so Bugsnag stack-trace grouping is preserved.
- error-handler.ts: sets event.groupingHash only when a real category resolves,
  and emits error_grouping metadata (slice_name, http_status, error_code,
  error_class) so backend routing works regardless of CLI version.
- error.ts: raw graphql-request ClientError 401/429/THROTTLED is now treated as
  expected and kept out of crash reporting.

Shared error-categorizer.ts / storage.ts (Monorail taxonomy) are untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@stephanie-shopify stephanie-shopify requested review from a team as code owners June 23, 2026 00:48
@github-actions github-actions Bot added the Area: @shopify/cli @shopify/cli package issues label Jun 23, 2026
Knip flagged the interface as an unused export — it is only used internally
as the return type of errorGroupingSignals; callers destructure inline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@stephanie-shopify stephanie-shopify requested a review from mbarak June 24, 2026 18:55

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be included? it has language that is specific to the current issue and not necessarily for long term use.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 8e0dc34 — the design record lives in the PR description and plan, not in the shipped package.

const message = errorMessage(error)
// Object signals are authoritative; message-derived signals fill gaps (e.g. 5xx errors that the
// API layer flattens into an AbortError, dropping the structured status field).
const signals: ErrorGroupingSignals = {...signalsFromMessage(message), ...errorGroupingSignals(error)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above in errorGroupingSignals only assign if there's a value i.e. never assign null or undefined.


🐛 Bug: The comment says message-derived signals fill gaps, but object spread copies keys even when their value is explicitly undefined. errorGroupingSignals can assign httpStatus or code as undefined; spreading those values over signalsFromMessage(message) can erase a valid status or code recovered from the message fallback. That makes the fallback less reliable for flattened API errors.

Suggestion: Consider merging field-by-field with nullish coalescing, or only assigning properties to signals when the extracted value is not undefined. For example, derive fromMessage and fromError separately, then build httpStatus, code, and errorClass as fromError.value ?? fromMessage.value.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug, fixed. errorGroupingSignals no longer assigns a field at all when the value is absent, and the merge in resolveErrorGrouping is now field-by-field with ?? (fromError.x ?? fromMessage.x) so an absent object signal can never clobber one recovered from the message. Pinned with a test: a GraphQLClientError whose code is only present in the embedded message JSON now keeps that code in the signals.

function categoryFromSignals(signals: ErrorGroupingSignals): string | undefined {
const {httpStatus, code} = signals

if (code === 'THROTTLED' || httpStatus === 429) return 'rate_limit'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: The structured decision table handles HTTP 429 and extensions.code === 'THROTTLED', but existing retry logic also treats extensions.code === '429' as a GraphQL rate-limit signal, even when the HTTP status is 200. A raw ClientError with that established shape can currently miss the rate_limit bucket.

Suggestion: Consider adding code === '429' to the rate-limit branch and covering the HTTP-200/code-429 shape with a regression test.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed against errorsIncludeStatus429 in private/node/api.ts:266extensions.code === '429' is the established shape. Added '429' to the rate-limit set and a regression test (clientError(200, '429') -> theme:rate_limit:http-200-429).

const {httpStatus, code} = signals

if (code === 'THROTTLED' || httpStatus === 429) return 'rate_limit'
if (httpStatus === 401) return 'authentication'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Suggestion: The documented decision table treats ACCESS_DENIED as a permission signal, but the implementation checks httpStatus === 401 before code === 'ACCESS_DENIED'. If an error carries both HTTP 401 and ACCESS_DENIED, it will be grouped as authentication rather than permission.

Suggestion: Consider moving the httpStatus === 403 || code === 'ACCESS_DENIED' branch above the 401 branch if access-denied should win. If 401 is intentionally higher precedence, add a focused test or documentation note so that routing choice is explicit.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deliberate — keeping 401 → authentication. HTTP 401 is the authoritative transport signal: a request the server rejected as unauthenticated is an authentication problem even if a GraphQL code also reports access-denied (which in practice pairs with 403, handled by the very next branch). Made the precedence explicit in a categoryFromSignals comment and pinned it with a test: a 401 + ACCESS_DENIED groups as theme:authentication:http-401-access-denied (the code is still recorded in the signature).

*/
function firstExtensionCode(errors: unknown): string | undefined {
if (!Array.isArray(errors)) return undefined
const code = (errors[0] as {extensions?: {code?: unknown}} | undefined)?.extensions?.code

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: firstExtensionCode only reads errors[0].extensions.code, but GraphQL responses can include multiple errors and existing code in this package treats those arrays as multi-entry structures. A later error may carry THROTTLED, 429, or ACCESS_DENIED, and App Management errors can carry nested extensions.app_errors.errors[].category === 'access_denied'. Missing those shapes can route known API conditions to a fallback or default bucket instead of the intended structured grouping.

Suggestion: Consider replacing the first-error lookup with a small normalization helper that scans every GraphQL error, recognizes both THROTTLED and 429 as rate-limit signals, recognizes ACCESS_DENIED and nested App Management access_denied as permission signals, and is shared with the reporting-suppression logic where possible.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Replaced the first-error lookup with a shared scanner that walks every error and recognizes THROTTLED/429 (rate-limit) and ACCESS_DENIED plus the nested App Management extensions.app_errors.errors[].category === 'access_denied' (permission). It lives in a new dependency-free graphql-error-codes.ts so it can be shared with the suppression logic in error.tserror.ts can't import error-grouping.ts directly without creating an error.ts → headers.ts → error.ts cycle. Grouping picks a routable code (rate-limit, then permission) so a benign leading error can't mask one further down the array.

if (groupingHash) {
event.groupingHash = groupingHash
}
const {httpStatus, code, errorClass} = errorGroupingSignals(error)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: event.groupingHash is derived through errorGroupingHash(error, sliceName), which can combine typed error signals, message-derived HTTP/code signals, and fallback keyword categorization. The error_grouping metadata immediately below is built from errorGroupingSignals(error) alone, so it can omit facts that the hash used, such as a message-derived http_status for flattened errors or a fallback category. It also omits the semantic category field that the design document says will be emitted, which weakens backend routing that is supposed to depend on stable metadata tags rather than just the hash.

Suggestion: Consider exposing a single helper from error-grouping.ts that returns the fully resolved grouping result, for example {hash, category, signals}, and use that result for both event.groupingHash and event.addMetadata('error_grouping', ...). That would also avoid recalculating errorGroupingSignals(error) separately for the hash and the metadata.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — emitting category was in the design and I dropped it in the wiring. Refactored to a single resolveErrorGrouping(error, slice) that returns {hash, category, signals}; the reporter now drives both event.groupingHash and the error_grouping metadata from that one result, so the metadata reflects message-derived signals and includes category. Added a test asserting a 5xx flattened into an AbortError surfaces http_status: 503 and category: 'server' in the metadata.

return false
}
const status = error.response?.status
if (status === 401 || status === 429) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 401 transient?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair — 401 isn't transient in the retry sense. It means the user's session token is expired or invalid: a credential/environment condition we never want as a crash report (called out in #7891), distinct from rate limiting. transient was a poor umbrella. Renamed to isExpectedApiError and split the doc comment so 401 (credentials) and 429/THROTTLED (rate limit) are explained separately.

const errors = error.response?.errors
if (Array.isArray(errors)) {
const code = (errors[0] as {extensions?: {code?: unknown}} | undefined)?.extensions?.code
return code === 'THROTTLED'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: isTransientApiError suppresses HTTP 429 and first-error THROTTLED, but it misses the established GraphQL extensions.code === '429' rate-limit shape and any transient code that appears after the first error entry. After retries are exhausted, those raw ClientErrors could still be treated as unexpected and sent to Bugsnag even though the retry layer classifies them as transient rate limiting.

Suggestion: Consider scanning all entries in error.response.errors and treating both THROTTLED and 429 as transient. A regression test like shouldReportErrorAsUnexpected(clientError(200, '429')) === false would keep the suppression logic aligned with the retry-layer fixture shape.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via the same shared scanner as the grouping comment — isExpectedApiError now calls hasRateLimitCode(error.response?.errors), which scans every error for THROTTLED/429. Added regression tests: shouldReportErrorAsUnexpected(clientError(200, '429')) === false, and a rate-limit code on a non-first error entry is also suppressed.

stephanie-shopify and others added 3 commits June 29, 2026 12:16
…r, suppression naming

Responds to Mbarak's review on #7901:
- Scan every GraphQL error (not just errors[0]) for routing codes, recognizing
  THROTTLED/429 and ACCESS_DENIED incl. nested App Management app_errors
  access_denied. Extracted into a dependency-free graphql-error-codes module
  shared by grouping and crash-report suppression (avoids the error.ts ->
  headers.ts -> error.ts import cycle).
- Add the string GraphQL code "429" as a rate-limit signal, matching
  errorsIncludeStatus429 in private/node/api.ts.
- Merge object signals over message-derived signals field-by-field with `??`
  so an absent object code/status can no longer erase a recovered value.
- Single resolveErrorGrouping() returns {hash, category, signals}; the reporter
  drives both event.groupingHash and the error_grouping metadata from it, so the
  metadata now carries category and message-derived signals.
- Rename isTransientApiError -> isExpectedApiError and document why 401
  (expired/invalid credentials) is expected, distinct from rate limiting.
- Document the deliberate 401-over-ACCESS_DENIED precedence and pin it + the new
  shapes with tests. Remove the issue-specific DESIGN.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrap the two resolveErrorGrouping return objects across lines (prettier/prettier)
and flatten the isExpectedApiError jsdoc to remove indented continuation lines
(jsdoc/check-indentation, enforced on public API files).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…TH_CONDITION_TYPES

Incidental codegen drift cleanup, not part of the error-grouping logic. The
admin schema gained this OnlineStoreThemeFilesUserErrorsCode enum value upstream;
the committed generated types were stale. The codegen check only runs when a PR
touches graphql-pathed files, so this PR's new graphql-error-codes.ts un-skipped
it and surfaced the pre-existing drift. Reproduces codegen output byte-for-byte
(no local schema fetch available).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Differences in type declarations

We detected differences in the type declarations generated by Typescript for this branch compared to the baseline ('main' branch). Please, review them to ensure they are backward-compatible. Here are some important things to keep in mind:

  • Some seemingly private modules might be re-exported through public modules.
  • If the branch is behind main you might see odd diffs, rebase main into this branch.

New type declarations

packages/cli-kit/dist/private/node/analytics/error-grouping.d.ts
/**
 * Structured signals extracted from an error, used to group it into a meaningful Bugsnag bucket.
 *
 * These are read from the *original* typed error (before it is flattened to a generic `Error`
 * for reporting), so we group on facts the error already carries rather than by re-parsing a
 * stringified message.
 */
interface ErrorGroupingSignals {
    httpStatus?: number;
    code?: string;
    errorClass?: string;
}
/**
 * The fully resolved grouping decision for an error: the Bugsnag grouping hash (or `undefined` to
 * fall back to stack-trace grouping), the semantic category, and the structured signals that drove
 * the decision. The reporter uses a single resolution for both `event.groupingHash` and the
 * `error_grouping` metadata so they can never disagree.
 */
interface ResolvedErrorGrouping {
    hash?: string;
    category?: string;
    signals: ErrorGroupingSignals;
}
/**
 * Extracts structured grouping signals from an error object.
 *
 * Only assigns a field when a value is actually present, so the result can be safely merged over
 * message-derived signals without an explicit `undefined` erasing a recovered value.
 *
 * @param error - The original error (any type).
 * @returns The HTTP status, GraphQL error code, and error class name when available.
 */
export declare function errorGroupingSignals(error: unknown): ErrorGroupingSignals;
/**
 * Resolves the full grouping decision for an error: hash, category, and the signals behind it.
 *
 * Categories are resolved structured-first (HTTP status / GraphQL code), falling back to the
 * shared keyword categorizer only for untyped errors. When no meaningful category can be
 * resolved, `hash` is `undefined` so the caller leaves `event.groupingHash` unset and Bugsnag's
 * default stack-trace grouping applies — this avoids merging genuinely distinct unknown bugs.
 *
 * @param error - The original error (any type).
 * @param sliceName - The product slice (`app`, `theme`, `store`, `hydrogen`, or `cli`).
 * @returns The resolved hash (or `undefined`), category, and structured signals.
 */
export declare function resolveErrorGrouping(error: unknown, sliceName: string): ResolvedErrorGrouping;
/**
 * Builds a Bugsnag grouping hash of the form `${slice}:${category}:${signature}`.
 *
 * Thin wrapper over {@link resolveErrorGrouping} for callers that only need the hash.
 *
 * @param error - The original error (any type).
 * @param sliceName - The product slice (`app`, `theme`, `store`, `hydrogen`, or `cli`).
 * @returns The grouping hash, or `undefined` to fall back to stack-trace grouping.
 */
export declare function errorGroupingHash(error: unknown, sliceName: string): string | undefined;
export {};
packages/cli-kit/dist/private/node/analytics/graphql-error-codes.d.ts
/**
 * Pure helpers for reading routing-relevant codes out of a GraphQL error response.
 *
 * This module is intentionally dependency-free: it is imported by both the grouping logic
 * (`error-grouping.ts`) and the crash-report suppression logic (`../../public/node/error.ts`).
 * `error.ts` cannot import `error-grouping.ts` directly — that would create an
 * `error.ts → headers.ts → error.ts` import cycle — so the shared scanning logic lives here,
 * where it imports nothing from cli-kit.
 */
/**
 * Collects every routing-relevant code from a GraphQL `errors` value.
 *
 * Scans *all* errors (not just the first), reading both the top-level `extensions.code` and the
 * nested App Management shape `extensions.app_errors.errors[].category`. The API can also return
 * `errors` as a string (e.g. some 401s), which yields no codes.
 *
 * @param errors - The `errors` value from a GraphQL response (array, string, or undefined).
 * @returns Every string code/category found, in document order.
 */
export declare function graphQLErrorCodes(errors: unknown): string[];
/**
 * Whether a single code is a rate-limit signal (`THROTTLED` or `429`).
 *
 * Mirrors the established shape detected by `errorsIncludeStatus429` in `private/node/api.ts`,
 * where `extensions.code === '429'` signals rate limiting even at HTTP 200.
 */
export declare function isRateLimitCode(code: string | undefined): boolean;
/**
 * Whether a single code/category is a permission / access-denied signal.
 */
export declare function isPermissionCode(code: string | undefined): boolean;
/**
 * Whether any GraphQL error carries a rate-limit code. Convenience for suppression logic that only
 * needs a boolean over the raw `errors` value.
 *
 * @param errors - The `errors` value from a GraphQL response.
 * @returns True when a rate-limit code is present in any error.
 */
export declare function hasRateLimitCode(errors: unknown): boolean;

Existing type declarations

We found no diffs with existing type declarations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: @shopify/cli @shopify/cli package issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

split observe analytics group for cleaner resiliency tracking

2 participants