fix(eval): fix search_tool correctness always scoring 0% by Tomkess · Pull Request #1675 · gooddata/gooddata-python-sdk

Tomkess · 2026-06-27T20:34:01Z

Summary

The tool_correctness metric in search_tool evaluations was always False (0%), making the dashboard metric useless.

Root cause: _args_match checked two fields that never matched real model calls:

emit_widget — this parameter was renamed to user_requested_search in the tool schema; the model never sends emit_widget, so actual_args.get("emit_widget") was always None vs the expected false
limit — declared as optional (int | None) in the schema; the model omits it and relies on the server default, while fixtures hardcode 10

Fix: Only check keywords (case-insensitive) and object_types — the two fields that determine whether the search was semantically correct. limit and the widget display flag are not part of search correctness.

Test Plan

Run uv run pytest tests/ -x -q in packages/gooddata-eval
Re-run search_tool eval items and verify tool_correctness now reflects actual model behaviour (models that call with the right keywords and object types should score True)

Risk

Low — single function change in the evaluator, no effect on eval runner or production code.

Summary by CodeRabbit

Bug Fixes
- Improved search tool call matching so results are judged by the most relevant search inputs.
- Search keywords are now compared without case sensitivity, making matching more consistent.
- Minor options no longer need to match exactly when evaluating search quality, reducing false mismatches.

_args_match checked emit_widget (renamed to user_requested_search in the tool schema) and limit (optional, server-side default). Both mismatched on every real model call, so tool_correctness was always False regardless of whether the model used the right keywords and object types. Fix: evaluate only keywords (case-insensitive) and object_types — the two fields that actually determine whether the search was semantically correct.

coderabbitai · 2026-06-27T20:34:19Z

📝 Walkthrough

Walkthrough

_args_match in the search tool evaluator is updated to compare only keywords (case-insensitively) and object_types when determining semantic correctness, dropping the previous equality checks on limit and emit_widget.

Search Tool Argument Matching

Layer / File(s)	Summary
Relax `_args_match` criteria `packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`	Removes `limit` and `emit_widget` equality checks; converts `keywords` to lowercase before sorting for case-insensitive comparison.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐇 A keyword is a keyword, big or small,
No matter the case, I'll find them all!
Forget the limit, forget the widget too,
Just match the words — that's the thing to do.
Hopping through logic, clean and bright! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: fixing search_tool evaluation correctness that was always scoring 0%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`:
- Around line 12-16: The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e3e175f-7f97-436d-9802-bd0a292bd4bb

📥 Commits

Reviewing files that changed from the base of the PR and between 726a0b0 and 2ebc8b0.

📒 Files selected for processing (1)

packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

coderabbitai · 2026-06-27T20:36:03Z

+    actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or []))
+    expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or []))
+    if actual_kw != expected_kw:
        return False
-    if sorted(actual_args.get("object_types") or []) != sorted(expected_args.get("object_types") or []):
-        return False
-    if actual_args.get("limit") != expected_args.get("limit"):
-        return False
-    return actual_args.get("emit_widget") == expected_args.get("emit_widget")
+    return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or [])


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Harden _args_match against malformed JSON argument types.

parsed_arguments() returns raw model-emitted JSON, so a bad tool call like {"keywords":[1]} or mixed object_types will raise here (.lower() / sorted(...)) and abort evaluation instead of producing tool_correctness=False. Treat non-string entries as invalid input and normalize defensively.

Proposed fix

def _args_match(actual_args: dict, expected_args: dict) -> bool: # Only keywords and object_types determine semantic correctness. # limit is optional with a server-side default; emit_widget was renamed to # user_requested_search in the tool schema — neither affects search quality. - actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or [])) - expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or [])) + def _normalize_str_list(value: object, *, lowercase: bool = False) -> list[str]: + if not isinstance(value, list): + return [] + items = [item for item in value if isinstance(item, str)] + return sorted(item.lower() if lowercase else item for item in items) + + actual_kw = _normalize_str_list(actual_args.get("keywords"), lowercase=True) + expected_kw = _normalize_str_list(expected_args.get("keywords"), lowercase=True) if actual_kw != expected_kw: return False - return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or []) + return _normalize_str_list(actual_args.get("object_types")) == _normalize_str_list( + expected_args.get("object_types") + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or []))

expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or []))

if actual_kw != expected_kw:

return False

if sorted(actual_args.get("object_types") or []) != sorted(expected_args.get("object_types") or []):

return False

if actual_args.get("limit") != expected_args.get("limit"):

return False

return actual_args.get("emit_widget") == expected_args.get("emit_widget")

return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or [])

def _normalize_str_list(value: object, *, lowercase: bool = False) -> list[str]:

if not isinstance(value, list):

return []

items = [item for item in value if isinstance(item, str)]

return sorted(item.lower() if lowercase else item for item in items)

actual_kw = _normalize_str_list(actual_args.get("keywords"), lowercase=True)

expected_kw = _normalize_str_list(expected_args.get("keywords"), lowercase=True)

if actual_kw != expected_kw:

return False

return _normalize_str_list(actual_args.get("object_types")) == _normalize_str_list(

expected_args.get("object_types")

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py` around lines 12 - 16, The _args_match comparison is not defensive enough and can crash on malformed tool arguments from parsed_arguments(). Update _args_match to validate and normalize actual_args["keywords"] and actual_args["object_types"] before lowercasing or sorting, treating any non-string or unexpected entry as a mismatch instead of raising. Keep the existing matching behavior for valid inputs, but ensure bad JSON like mixed types or numeric keywords returns False rather than aborting evaluation.

codecov · 2026-06-27T20:37:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.21%. Comparing base (653f5bc) to head (2ebc8b0).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1675   +/-   ##
=======================================
  Coverage   79.21%   79.21%           
=======================================
  Files         232      232           
  Lines       15809    15809           
=======================================
  Hits        12523    12523           
  Misses       3286     3286

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Tomkess requested review from hkad98, lupko and pcerny as code owners June 27, 2026 20:34

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): fix search_tool correctness always scoring 0%#1675

fix(eval): fix search_tool correctness always scoring 0%#1675
Tomkess wants to merge 1 commit into
masterfrom
fix/eval-search-tool-correctness

Tomkess commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 27, 2026

Uh oh!

codecov Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Tomkess commented Jun 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Risk

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 27, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tomkess commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading