Extend the Message format to support tool calling, extended thinking, and multimodality

This issue aims to bring together several related discussions (including #138, #149, and #50) into one cohesive proposal around restructuring the `Message` type to better reflect the realities of modern on-device LLMs.

## Motivation
 
Even sub-10B models increasingly support multimodal inputs, tool calling, and extended thinking. The current `Message` shape (a role and a string content) doesn't have room for any of these, which makes it hard to build real agent-style applications on top of the Prompt API. Rather than solving each capability in isolation, it seems worth discussing a more unified message structure that could accommodate all of them cleanly.

## Proposed message structure
 
The key idea is to make `content` a union of typed content parts (text, image, audio), and to introduce distinct message interfaces per role, since the properties that make sense differ significantly between `user`, `assistant`, `system`, and `tool` messages:

```typescript
export type MessageContent = string | MessageContentPart[];

export type MessageContentPart = TextContentPart | ImageContentPart | AudioContentPart;

export interface TextContentPart {
    type: 'text';
    text: string;
}

export interface ImageContentPart {
    type: 'image';
    data: string | Blob | ArrayBuffer | Uint8Array;
    mimeType?: 'image/png' | 'image/jpeg' | 'image/webp' | string;
}

export interface AudioContentPart {
    type: 'audio';
    data: string | Blob | ArrayBuffer | Uint8Array;
    mimeType?: 'audio/wav' | 'audio/mpeg' | 'audio/ogg' | string;
}

export interface SystemMessage {
    role: 'system';
    content: MessageContent;
}

export interface UserMessage {
    role: 'user';
    content: MessageContent;
}

export interface AssistantMessage {
    role: 'assistant';
    content?: MessageContent;
    thinking?: string;
    toolCalls?: Array<{
        id: string;
        type?: 'function';
        function: {
            name: string;
            arguments: Record<string, unknown>;
        };
    }>;
}

export interface ToolMessage {
    role: 'tool';
    toolCallId: string;
    name?: string;
    content: MessageContent;
}

export type Message = SystemMessage | UserMessage | AssistantMessage | ToolMessage;
```

Note that `MessageContent` is shared across user/system messages and tool call responses keeping the content shape consistent regardless of role.
 
Because messages are fully serializable, they can also be stored and reused across sessions via `initialPrompts`, which would address #50.
 
## Proposed response structure
 
On the output side, a flat text response also doesn't map well to agentic interactions. A single user prompt can trigger multiple model turns: one or more to emit tool calls, others to synthesize the result. A more flexible `ModelResponse` could model this as a list of `RunResult`s:

```typescript
export interface Usage {
    promptTokens: number;
    completionTokens: number;
    totalTokens: number;
}

export interface RunResult {
    thinkingText: string;
    text: string;
    tools: ToolCallResult[];
    usage: Usage;
}

export interface RequestResult {
    done: boolean;
    runs: RunResult[];
    usage: Usage;
}

export type StreamChunk = RequestResult;
```

A response to "What's the weather in London?" with tool calling and thinking enabled might look like:

```javascript
{
  done: true,
  runs: [
    {
      thinkingText:
        '1.  **Analyze the user\'s request:** The user is asking ...,
      text: "",
      tools: [
        {
          id: "toolcall_3",
          name: "get_weather",
          args: {
            location: "London",
          },
          output: {
            content: [
              {
                type: "text",
                text: "The weather in London in Sunny, 20 degrees celsius.",
              },
            ],
          },
          durationMs: 0.13499999791383743,
        },
      ],
      usage: {
        promptTokens: 123,
        completionTokens: 156,
        totalTokens: 279,
      },
    },
    {
      thinkingText:
        'The user is asking for the current weather in ...',
      text: "The current weather in London is sunny with a temperature of 20 degrees.",
      tools: [],
      usage: {
        promptTokens: 167,
        completionTokens: 152,
        totalTokens: 319,
      },
    },
  ],
  usage: {
    promptTokens: 290,
    completionTokens: 308,
    totalTokens: 598,
  },
};
```

Two runs: the first produces the tool call, the second synthesizes the final answer.

## Reference implementation

I've been exploring a very similar shape while building an [AgentSDK for Transformers.js](https://github.com/huggingface/transformers.js/tree/feat/agent-sdk-prompt-api), which tries to align closely with the Prompt API. A minimal working example with Gemma 4 running in-browser via WebGPU looks like this:

```typescript
const model = await Model.load(
  { modelId: 'onnx-community/gemma-4-E2B-it-ONNX', device: 'webgpu', dtype: 'q4f16' },
  (info) => (info.status === 'progress_total') && console.log(info),
);

const agent = new Agent({
  model,
  initialPrompts: [{ role: 'system', content: 'You are a helpful research assistant.' }],
});

const result = await agent.run('Who are you?');
```

This isn't meant as a concrete implementation proposal, more as a working reference point that the two APIs are converging on similar problems, and that aligning on message shape early would benefit both.

---

Looking forward to hearing thoughts, especially on whether the per-role interface split feels right, and whether the multi-run response model is a direction worth pursuing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extend the Message format to support tool calling, extended thinking, and multimodality #209

Motivation

Proposed message structure

Proposed response structure

Reference implementation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Extend the Message format to support tool calling, extended thinking, and multimodality #209

Description

Motivation

Proposed message structure

Proposed response structure

Reference implementation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions