Skip to content

fix: address review findings (categorical detection, bitset safety, API)#18

Merged
jxucoder merged 1 commit into
mainfrom
fix/review-findings
Jun 27, 2026
Merged

fix: address review findings (categorical detection, bitset safety, API)#18
jxucoder merged 1 commit into
mainfrom
fix/review-findings

Conversation

@jxucoder

Copy link
Copy Markdown
Owner

Summary

Fixes the issues surfaced in code review of the recent feat: 12 UX improvements and fix: correctness/safety commits:

  • Categorical auto-detection bug: dtype is object is always False for pandas/numpy object columns (np.dtype('O') is objectFalse), so string columns were silently not detected. Now tested via dtype.kind in ('O','U','S') (+ categories for pandas Categorical). Numpy object and string arrays now work.
  • clear_tree_workspace_cache() unreachable: was defined only in _backends/_cuda.py and never exported (and unsafe to import on CPU-only installs). Added a backend-dispatching wrapper with a CPU no-op and exported it as openboost.clear_tree_workspace_cache.
  • Categorical bitset ≥64 safety: the prior fix guarded only the CPU partition path. Confirmed the GPU partition kernel is ordinal-only (no bitset), then capped the remaining 5 1 << cat / bitset >> bin sites — both CPU+GPU split-finders skip bits ≥64 and all three predict paths route bins ≥64 right. Categories with bin ≥64 now consistently go right everywhere (documented).
  • backend_context docs: clarified it mutates a process-global (not thread-local) backend; documented the intentional None→auto-detect restore.
  • DRY: extracted warn_if_early_stopping_without_eval_set() into _callbacks.py, removing 5 duplicated warning blocks across GradientBoosting / MultiClass / DART / NaturalBoost.

Test plan

  • ruff check src/openboost/ — no new errors (2 remaining SIM108 are pre-existing in unrelated CUDA kernels)
  • Full CPU suite: OPENBOOST_BACKEND=cpu uv run pytest tests/511 passed, 25 skipped
  • Smoke: object-array categorical auto-detect, numeric arrays unaffected, clear_tree_workspace_cache() CPU no-op, EarlyStopping-without-eval_set warning fires
  • 100-category (>64) feature trains + predicts deterministically/consistently
  • py_compile on all edited files (syntax-checks the @cuda.jit kernels, which can't JIT without a GPU)

Made with Cursor

- Fix categorical auto-detection: test dtype.kind instead of `is object`
  (np.dtype('O') is object is False), so pandas/numpy string columns are
  actually detected
- Export clear_tree_workspace_cache() via backend dispatch with a CPU no-op
  (was defined only in _cuda.py and unreachable / import-unsafe on CPU)
- Cap categorical bitset ops at 64 bits consistently across CPU+GPU
  split-finding and all predict paths (bins >= 64 always route right)
- Document backend_context as a process-global (not thread-local) switch
- Extract shared warn_if_early_stopping_without_eval_set() helper, removing
  5 duplicated warning blocks across GradientBoosting/MultiClass/DART/NaturalBoost

Co-authored-by: Cursor <cursoragent@cursor.com>
@jxucoder jxucoder merged commit 30013b1 into main Jun 27, 2026
0 of 4 checks passed
@jxucoder jxucoder deleted the fix/review-findings branch June 27, 2026 21:48

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6eb46e8eef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +565 to +566
if cat < 64:
cat_bitset |= (1 << cat)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude high categories before scoring categorical splits

When a categorical feature has 65+ categories and the best Fisher split places any category bin >=64 on the left, this new guard silently drops those bins from the stored bitset even though _find_best_categorical_split already included their gradients/hessians in left_g/left_h when computing best_gain and min_child_weight. Because partitioning/prediction route those bins right, the accepted split can violate the scored child weights or create an empty/wrong left child, silently corrupting training for high-cardinality categorical features instead of representing the scored split. The CUDA split finder mirrors the same pattern.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant