This issue aims to bring together several related discussions (including #138, #149, and #50) into one cohesive proposal around restructuring the Message type to better reflect the realities of modern on-device LLMs.
Motivation
Even sub-10B models increasingly support multimodal inputs, tool calling, and extended thinking. The current Message shape (a role and a string content) doesn't have room for any of these, which makes it hard to build real agent-style applications on top of the Prompt API. Rather than solving each capability in isolation, it seems worth discussing a more unified message structure that could accommodate all of them cleanly.
Proposed message structure
The key idea is to make content a union of typed content parts (text, image, audio), and to introduce distinct message interfaces per role, since the properties that make sense differ significantly between user, assistant, system, and tool messages:
export type MessageContent = string | MessageContentPart[];
export type MessageContentPart = TextContentPart | ImageContentPart | AudioContentPart;
export interface TextContentPart {
type: 'text';
text: string;
}
export interface ImageContentPart {
type: 'image';
data: string | Blob | ArrayBuffer | Uint8Array;
mimeType?: 'image/png' | 'image/jpeg' | 'image/webp' | string;
}
export interface AudioContentPart {
type: 'audio';
data: string | Blob | ArrayBuffer | Uint8Array;
mimeType?: 'audio/wav' | 'audio/mpeg' | 'audio/ogg' | string;
}
export interface SystemMessage {
role: 'system';
content: MessageContent;
}
export interface UserMessage {
role: 'user';
content: MessageContent;
}
export interface AssistantMessage {
role: 'assistant';
content?: MessageContent;
thinking?: string;
toolCalls?: Array<{
id: string;
type?: 'function';
function: {
name: string;
arguments: Record<string, unknown>;
};
}>;
}
export interface ToolMessage {
role: 'tool';
toolCallId: string;
name?: string;
content: MessageContent;
}
export type Message = SystemMessage | UserMessage | AssistantMessage | ToolMessage;
Note that MessageContent is shared across user/system messages and tool call responses keeping the content shape consistent regardless of role.
Because messages are fully serializable, they can also be stored and reused across sessions via initialPrompts, which would address #50.
Proposed response structure
On the output side, a flat text response also doesn't map well to agentic interactions. A single user prompt can trigger multiple model turns: one or more to emit tool calls, others to synthesize the result. A more flexible ModelResponse could model this as a list of RunResults:
export interface Usage {
promptTokens: number;
completionTokens: number;
totalTokens: number;
}
export interface RunResult {
thinkingText: string;
text: string;
tools: ToolCallResult[];
usage: Usage;
}
export interface RequestResult {
done: boolean;
runs: RunResult[];
usage: Usage;
}
export type StreamChunk = RequestResult;
A response to "What's the weather in London?" with tool calling and thinking enabled might look like:
{
done: true,
runs: [
{
thinkingText:
'1. **Analyze the user\'s request:** The user is asking ...,
text: "",
tools: [
{
id: "toolcall_3",
name: "get_weather",
args: {
location: "London",
},
output: {
content: [
{
type: "text",
text: "The weather in London in Sunny, 20 degrees celsius.",
},
],
},
durationMs: 0.13499999791383743,
},
],
usage: {
promptTokens: 123,
completionTokens: 156,
totalTokens: 279,
},
},
{
thinkingText:
'The user is asking for the current weather in ...',
text: "The current weather in London is sunny with a temperature of 20 degrees.",
tools: [],
usage: {
promptTokens: 167,
completionTokens: 152,
totalTokens: 319,
},
},
],
usage: {
promptTokens: 290,
completionTokens: 308,
totalTokens: 598,
},
};
Two runs: the first produces the tool call, the second synthesizes the final answer.
Reference implementation
I've been exploring a very similar shape while building an AgentSDK for Transformers.js, which tries to align closely with the Prompt API. A minimal working example with Gemma 4 running in-browser via WebGPU looks like this:
const model = await Model.load(
{ modelId: 'onnx-community/gemma-4-E2B-it-ONNX', device: 'webgpu', dtype: 'q4f16' },
(info) => (info.status === 'progress_total') && console.log(info),
);
const agent = new Agent({
model,
initialPrompts: [{ role: 'system', content: 'You are a helpful research assistant.' }],
});
const result = await agent.run('Who are you?');
This isn't meant as a concrete implementation proposal, more as a working reference point that the two APIs are converging on similar problems, and that aligning on message shape early would benefit both.
Looking forward to hearing thoughts, especially on whether the per-role interface split feels right, and whether the multi-run response model is a direction worth pursuing.
This issue aims to bring together several related discussions (including #138, #149, and #50) into one cohesive proposal around restructuring the
Messagetype to better reflect the realities of modern on-device LLMs.Motivation
Even sub-10B models increasingly support multimodal inputs, tool calling, and extended thinking. The current
Messageshape (a role and a string content) doesn't have room for any of these, which makes it hard to build real agent-style applications on top of the Prompt API. Rather than solving each capability in isolation, it seems worth discussing a more unified message structure that could accommodate all of them cleanly.Proposed message structure
The key idea is to make
contenta union of typed content parts (text, image, audio), and to introduce distinct message interfaces per role, since the properties that make sense differ significantly betweenuser,assistant,system, andtoolmessages:Note that
MessageContentis shared across user/system messages and tool call responses keeping the content shape consistent regardless of role.Because messages are fully serializable, they can also be stored and reused across sessions via
initialPrompts, which would address #50.Proposed response structure
On the output side, a flat text response also doesn't map well to agentic interactions. A single user prompt can trigger multiple model turns: one or more to emit tool calls, others to synthesize the result. A more flexible
ModelResponsecould model this as a list ofRunResults:A response to "What's the weather in London?" with tool calling and thinking enabled might look like:
Two runs: the first produces the tool call, the second synthesizes the final answer.
Reference implementation
I've been exploring a very similar shape while building an AgentSDK for Transformers.js, which tries to align closely with the Prompt API. A minimal working example with Gemma 4 running in-browser via WebGPU looks like this:
This isn't meant as a concrete implementation proposal, more as a working reference point that the two APIs are converging on similar problems, and that aligning on message shape early would benefit both.
Looking forward to hearing thoughts, especially on whether the per-role interface split feels right, and whether the multi-run response model is a direction worth pursuing.