fix: address review findings (categorical detection, bitset safety, API) by jxucoder · Pull Request #18 · jxucoder/openboost

jxucoder · 2026-06-27T21:48:13Z

Summary

Fixes the issues surfaced in code review of the recent feat: 12 UX improvements and fix: correctness/safety commits:

Categorical auto-detection bug: dtype is object is always False for pandas/numpy object columns (np.dtype('O') is object → False), so string columns were silently not detected. Now tested via dtype.kind in ('O','U','S') (+ categories for pandas Categorical). Numpy object and string arrays now work.
clear_tree_workspace_cache() unreachable: was defined only in _backends/_cuda.py and never exported (and unsafe to import on CPU-only installs). Added a backend-dispatching wrapper with a CPU no-op and exported it as openboost.clear_tree_workspace_cache.
Categorical bitset ≥64 safety: the prior fix guarded only the CPU partition path. Confirmed the GPU partition kernel is ordinal-only (no bitset), then capped the remaining 5 1 << cat / bitset >> bin sites — both CPU+GPU split-finders skip bits ≥64 and all three predict paths route bins ≥64 right. Categories with bin ≥64 now consistently go right everywhere (documented).
backend_context docs: clarified it mutates a process-global (not thread-local) backend; documented the intentional None→auto-detect restore.
DRY: extracted warn_if_early_stopping_without_eval_set() into _callbacks.py, removing 5 duplicated warning blocks across GradientBoosting / MultiClass / DART / NaturalBoost.

Test plan

ruff check src/openboost/ — no new errors (2 remaining SIM108 are pre-existing in unrelated CUDA kernels)
Full CPU suite: OPENBOOST_BACKEND=cpu uv run pytest tests/ → 511 passed, 25 skipped
Smoke: object-array categorical auto-detect, numeric arrays unaffected, clear_tree_workspace_cache() CPU no-op, EarlyStopping-without-eval_set warning fires
100-category (>64) feature trains + predicts deterministically/consistently
py_compile on all edited files (syntax-checks the @cuda.jit kernels, which can't JIT without a GPU)

Made with Cursor

- Fix categorical auto-detection: test dtype.kind instead of `is object` (np.dtype('O') is object is False), so pandas/numpy string columns are actually detected - Export clear_tree_workspace_cache() via backend dispatch with a CPU no-op (was defined only in _cuda.py and unreachable / import-unsafe on CPU) - Cap categorical bitset ops at 64 bits consistently across CPU+GPU split-finding and all predict paths (bins >= 64 always route right) - Document backend_context as a process-global (not thread-local) switch - Extract shared warn_if_early_stopping_without_eval_set() helper, removing 5 duplicated warning blocks across GradientBoosting/MultiClass/DART/NaturalBoost Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6eb46e8eef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-27T21:51:15Z

+        if cat < 64:
+            cat_bitset |= (1 << cat)


Exclude high categories before scoring categorical splits

When a categorical feature has 65+ categories and the best Fisher split places any category bin >=64 on the left, this new guard silently drops those bins from the stored bitset even though _find_best_categorical_split already included their gradients/hessians in left_g/left_h when computing best_gain and min_child_weight. Because partitioning/prediction route those bins right, the accepted split can violate the scored child weights or create an empty/wrong left child, silently corrupting training for high-cardinality categorical features instead of representing the scored split. The CUDA split finder mirrors the same pattern.

Useful? React with 👍 / 👎.

jxucoder merged commit 30013b1 into main Jun 27, 2026
0 of 4 checks passed

jxucoder deleted the fix/review-findings branch June 27, 2026 21:48

chatgpt-codex-connector Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: address review findings (categorical detection, bitset safety, API)#18

fix: address review findings (categorical detection, bitset safety, API)#18
jxucoder merged 1 commit into
mainfrom
fix/review-findings

jxucoder commented Jun 27, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jxucoder commented Jun 27, 2026

Summary

Test plan

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant