Skip to content

fix(eval): fix search_tool correctness always scoring 0%#1675

Open
Tomkess wants to merge 1 commit into
masterfrom
fix/eval-search-tool-correctness
Open

fix(eval): fix search_tool correctness always scoring 0%#1675
Tomkess wants to merge 1 commit into
masterfrom
fix/eval-search-tool-correctness

Conversation

@Tomkess

@Tomkess Tomkess commented Jun 27, 2026

Copy link
Copy Markdown

Summary

The tool_correctness metric in search_tool evaluations was always False (0%), making the dashboard metric useless.

Root cause: _args_match checked two fields that never matched real model calls:

  • emit_widget — this parameter was renamed to user_requested_search in the tool schema; the model never sends emit_widget, so actual_args.get("emit_widget") was always None vs the expected false
  • limit — declared as optional (int | None) in the schema; the model omits it and relies on the server default, while fixtures hardcode 10

Fix: Only check keywords (case-insensitive) and object_types — the two fields that determine whether the search was semantically correct. limit and the widget display flag are not part of search correctness.

Test Plan

  • Run uv run pytest tests/ -x -q in packages/gooddata-eval
  • Re-run search_tool eval items and verify tool_correctness now reflects actual model behaviour (models that call with the right keywords and object types should score True)

Risk

Low — single function change in the evaluator, no effect on eval runner or production code.

Summary by CodeRabbit

  • Bug Fixes
    • Improved search tool call matching so results are judged by the most relevant search inputs.
    • Search keywords are now compared without case sensitivity, making matching more consistent.
    • Minor options no longer need to match exactly when evaluating search quality, reducing false mismatches.

_args_match checked emit_widget (renamed to user_requested_search in the
tool schema) and limit (optional, server-side default). Both mismatched on
every real model call, so tool_correctness was always False regardless of
whether the model used the right keywords and object types.

Fix: evaluate only keywords (case-insensitive) and object_types — the two
fields that actually determine whether the search was semantically correct.
@Tomkess Tomkess requested review from hkad98, lupko and pcerny as code owners June 27, 2026 20:34
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

_args_match in the search tool evaluator is updated to compare only keywords (case-insensitively) and object_types when determining semantic correctness, dropping the previous equality checks on limit and emit_widget.

Search Tool Argument Matching

Layer / File(s) Summary
Relax _args_match criteria
packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py
Removes limit and emit_widget equality checks; converts keywords to lowercase before sorting for case-insensitive comparison.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐇 A keyword is a keyword, big or small,
No matter the case, I'll find them all!
Forget the limit, forget the widget too,
Just match the words — that's the thing to do.
Hopping through logic, clean and bright! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: fixing search_tool evaluation correctness that was always scoring 0%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`:
- Around line 12-16: The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e3e175f-7f97-436d-9802-bd0a292bd4bb

📥 Commits

Reviewing files that changed from the base of the PR and between 726a0b0 and 2ebc8b0.

📒 Files selected for processing (1)
  • packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

Comment on lines +12 to +16
actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or []))
expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or []))
if actual_kw != expected_kw:
return False
if sorted(actual_args.get("object_types") or []) != sorted(expected_args.get("object_types") or []):
return False
if actual_args.get("limit") != expected_args.get("limit"):
return False
return actual_args.get("emit_widget") == expected_args.get("emit_widget")
return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or [])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Harden _args_match against malformed JSON argument types.

parsed_arguments() returns raw model-emitted JSON, so a bad tool call like {"keywords":[1]} or mixed object_types will raise here (.lower() / sorted(...)) and abort evaluation instead of producing tool_correctness=False. Treat non-string entries as invalid input and normalize defensively.

Proposed fix
 def _args_match(actual_args: dict, expected_args: dict) -> bool:
     # Only keywords and object_types determine semantic correctness.
     # limit is optional with a server-side default; emit_widget was renamed to
     # user_requested_search in the tool schema — neither affects search quality.
-    actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or []))
-    expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or []))
+    def _normalize_str_list(value: object, *, lowercase: bool = False) -> list[str]:
+        if not isinstance(value, list):
+            return []
+        items = [item for item in value if isinstance(item, str)]
+        return sorted(item.lower() if lowercase else item for item in items)
+
+    actual_kw = _normalize_str_list(actual_args.get("keywords"), lowercase=True)
+    expected_kw = _normalize_str_list(expected_args.get("keywords"), lowercase=True)
     if actual_kw != expected_kw:
         return False
-    return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or [])
+    return _normalize_str_list(actual_args.get("object_types")) == _normalize_str_list(
+        expected_args.get("object_types")
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or []))
expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or []))
if actual_kw != expected_kw:
return False
if sorted(actual_args.get("object_types") or []) != sorted(expected_args.get("object_types") or []):
return False
if actual_args.get("limit") != expected_args.get("limit"):
return False
return actual_args.get("emit_widget") == expected_args.get("emit_widget")
return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or [])
def _normalize_str_list(value: object, *, lowercase: bool = False) -> list[str]:
if not isinstance(value, list):
return []
items = [item for item in value if isinstance(item, str)]
return sorted(item.lower() if lowercase else item for item in items)
actual_kw = _normalize_str_list(actual_args.get("keywords"), lowercase=True)
expected_kw = _normalize_str_list(expected_args.get("keywords"), lowercase=True)
if actual_kw != expected_kw:
return False
return _normalize_str_list(actual_args.get("object_types")) == _normalize_str_list(
expected_args.get("object_types")
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`
around lines 12 - 16, The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.

@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.21%. Comparing base (653f5bc) to head (2ebc8b0).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1675   +/-   ##
=======================================
  Coverage   79.21%   79.21%           
=======================================
  Files         232      232           
  Lines       15809    15809           
=======================================
  Hits        12523    12523           
  Misses       3286     3286           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant