fix: address review findings (categorical detection, bitset safety, API)#18
Conversation
- Fix categorical auto-detection: test dtype.kind instead of `is object`
(np.dtype('O') is object is False), so pandas/numpy string columns are
actually detected
- Export clear_tree_workspace_cache() via backend dispatch with a CPU no-op
(was defined only in _cuda.py and unreachable / import-unsafe on CPU)
- Cap categorical bitset ops at 64 bits consistently across CPU+GPU
split-finding and all predict paths (bins >= 64 always route right)
- Document backend_context as a process-global (not thread-local) switch
- Extract shared warn_if_early_stopping_without_eval_set() helper, removing
5 duplicated warning blocks across GradientBoosting/MultiClass/DART/NaturalBoost
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6eb46e8eef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if cat < 64: | ||
| cat_bitset |= (1 << cat) |
There was a problem hiding this comment.
Exclude high categories before scoring categorical splits
When a categorical feature has 65+ categories and the best Fisher split places any category bin >=64 on the left, this new guard silently drops those bins from the stored bitset even though _find_best_categorical_split already included their gradients/hessians in left_g/left_h when computing best_gain and min_child_weight. Because partitioning/prediction route those bins right, the accepted split can violate the scored child weights or create an empty/wrong left child, silently corrupting training for high-cardinality categorical features instead of representing the scored split. The CUDA split finder mirrors the same pattern.
Useful? React with 👍 / 👎.
Summary
Fixes the issues surfaced in code review of the recent
feat: 12 UX improvementsandfix: correctness/safetycommits:dtype is objectis alwaysFalsefor pandas/numpy object columns (np.dtype('O') is object→False), so string columns were silently not detected. Now tested viadtype.kind in ('O','U','S')(+categoriesfor pandas Categorical). Numpy object and string arrays now work.clear_tree_workspace_cache()unreachable: was defined only in_backends/_cuda.pyand never exported (and unsafe to import on CPU-only installs). Added a backend-dispatching wrapper with a CPU no-op and exported it asopenboost.clear_tree_workspace_cache.1 << cat/bitset >> binsites — both CPU+GPU split-finders skip bits ≥64 and all three predict paths route bins ≥64 right. Categories with bin ≥64 now consistently go right everywhere (documented).backend_contextdocs: clarified it mutates a process-global (not thread-local) backend; documented the intentionalNone→auto-detect restore.warn_if_early_stopping_without_eval_set()into_callbacks.py, removing 5 duplicated warning blocks across GradientBoosting / MultiClass / DART / NaturalBoost.Test plan
ruff check src/openboost/— no new errors (2 remainingSIM108are pre-existing in unrelated CUDA kernels)OPENBOOST_BACKEND=cpu uv run pytest tests/→ 511 passed, 25 skippedclear_tree_workspace_cache()CPU no-op, EarlyStopping-without-eval_set warning firespy_compileon all edited files (syntax-checks the@cuda.jitkernels, which can't JIT without a GPU)Made with Cursor