Hello, thank you for sharing this interesting work.
I had a question about the experimental setup in Table 1, specifically the rows under the Claude Code harness. The table lists the model as GPT-5.5, while the harness is described as Claude Code.
My understanding is that the paper separates the target model from the execution harness:
model: the frozen target model that solves the task
harness: the execution environment / adapter used to run the task
In that interpretation, the table would mean that GPT-5.5 was evaluated inside a Claude Code-style harness.
However, Section 4 also describes the Claude Code harness as mirroring the workspace contract through the claude CLI. Since Claude Code is usually understood as an Anthropic/Claude-based coding agent, I was unsure how GPT-5.5 is used in this setting.
Could you clarify which interpretation is correct?
- Does “Claude Code harness + GPT-5.5” mean that the Claude Code harness is used only as an execution interface, while the underlying target model is GPT-5.5?
- Or is the table meant to report results from Claude Code’s native model, in which case “GPT-5.5” might be a typo?
- If GPT-5.5 was indeed used through the Claude Code harness, could you briefly explain how the model was connected to the
claude CLI / harness adapter?
This would be helpful for correctly interpreting the cross-harness results and for avoiding confusion between the model and the execution environment.
Thank you!
Hello, thank you for sharing this interesting work.
I had a question about the experimental setup in Table 1, specifically the rows under the Claude Code harness. The table lists the model as GPT-5.5, while the harness is described as Claude Code.
My understanding is that the paper separates the target model from the execution harness:
model: the frozen target model that solves the taskharness: the execution environment / adapter used to run the taskIn that interpretation, the table would mean that GPT-5.5 was evaluated inside a Claude Code-style harness.
However, Section 4 also describes the Claude Code harness as mirroring the workspace contract through the
claudeCLI. Since Claude Code is usually understood as an Anthropic/Claude-based coding agent, I was unsure how GPT-5.5 is used in this setting.Could you clarify which interpretation is correct?
claudeCLI / harness adapter?This would be helpful for correctly interpreting the cross-harness results and for avoiding confusion between the model and the execution environment.
Thank you!