Skip to content

Extend the Message format to support tool calling, extended thinking, and multimodality #209

Description

@nico-martin

This issue aims to bring together several related discussions (including #138, #149, and #50) into one cohesive proposal around restructuring the Message type to better reflect the realities of modern on-device LLMs.

Motivation

Even sub-10B models increasingly support multimodal inputs, tool calling, and extended thinking. The current Message shape (a role and a string content) doesn't have room for any of these, which makes it hard to build real agent-style applications on top of the Prompt API. Rather than solving each capability in isolation, it seems worth discussing a more unified message structure that could accommodate all of them cleanly.

Proposed message structure

The key idea is to make content a union of typed content parts (text, image, audio), and to introduce distinct message interfaces per role, since the properties that make sense differ significantly between user, assistant, system, and tool messages:

export type MessageContent = string | MessageContentPart[];

export type MessageContentPart = TextContentPart | ImageContentPart | AudioContentPart;

export interface TextContentPart {
    type: 'text';
    text: string;
}

export interface ImageContentPart {
    type: 'image';
    data: string | Blob | ArrayBuffer | Uint8Array;
    mimeType?: 'image/png' | 'image/jpeg' | 'image/webp' | string;
}

export interface AudioContentPart {
    type: 'audio';
    data: string | Blob | ArrayBuffer | Uint8Array;
    mimeType?: 'audio/wav' | 'audio/mpeg' | 'audio/ogg' | string;
}

export interface SystemMessage {
    role: 'system';
    content: MessageContent;
}

export interface UserMessage {
    role: 'user';
    content: MessageContent;
}

export interface AssistantMessage {
    role: 'assistant';
    content?: MessageContent;
    thinking?: string;
    toolCalls?: Array<{
        id: string;
        type?: 'function';
        function: {
            name: string;
            arguments: Record<string, unknown>;
        };
    }>;
}

export interface ToolMessage {
    role: 'tool';
    toolCallId: string;
    name?: string;
    content: MessageContent;
}

export type Message = SystemMessage | UserMessage | AssistantMessage | ToolMessage;

Note that MessageContent is shared across user/system messages and tool call responses keeping the content shape consistent regardless of role.

Because messages are fully serializable, they can also be stored and reused across sessions via initialPrompts, which would address #50.

Proposed response structure

On the output side, a flat text response also doesn't map well to agentic interactions. A single user prompt can trigger multiple model turns: one or more to emit tool calls, others to synthesize the result. A more flexible ModelResponse could model this as a list of RunResults:

export interface Usage {
    promptTokens: number;
    completionTokens: number;
    totalTokens: number;
}

export interface RunResult {
    thinkingText: string;
    text: string;
    tools: ToolCallResult[];
    usage: Usage;
}

export interface RequestResult {
    done: boolean;
    runs: RunResult[];
    usage: Usage;
}

export type StreamChunk = RequestResult;

A response to "What's the weather in London?" with tool calling and thinking enabled might look like:

{
  done: true,
  runs: [
    {
      thinkingText:
        '1.  **Analyze the user\'s request:** The user is asking ...,
      text: "",
      tools: [
        {
          id: "toolcall_3",
          name: "get_weather",
          args: {
            location: "London",
          },
          output: {
            content: [
              {
                type: "text",
                text: "The weather in London in Sunny, 20 degrees celsius.",
              },
            ],
          },
          durationMs: 0.13499999791383743,
        },
      ],
      usage: {
        promptTokens: 123,
        completionTokens: 156,
        totalTokens: 279,
      },
    },
    {
      thinkingText:
        'The user is asking for the current weather in ...',
      text: "The current weather in London is sunny with a temperature of 20 degrees.",
      tools: [],
      usage: {
        promptTokens: 167,
        completionTokens: 152,
        totalTokens: 319,
      },
    },
  ],
  usage: {
    promptTokens: 290,
    completionTokens: 308,
    totalTokens: 598,
  },
};

Two runs: the first produces the tool call, the second synthesizes the final answer.

Reference implementation

I've been exploring a very similar shape while building an AgentSDK for Transformers.js, which tries to align closely with the Prompt API. A minimal working example with Gemma 4 running in-browser via WebGPU looks like this:

const model = await Model.load(
  { modelId: 'onnx-community/gemma-4-E2B-it-ONNX', device: 'webgpu', dtype: 'q4f16' },
  (info) => (info.status === 'progress_total') && console.log(info),
);

const agent = new Agent({
  model,
  initialPrompts: [{ role: 'system', content: 'You are a helpful research assistant.' }],
});

const result = await agent.run('Who are you?');

This isn't meant as a concrete implementation proposal, more as a working reference point that the two APIs are converging on similar problems, and that aligning on message shape early would benefit both.


Looking forward to hearing thoughts, especially on whether the per-role interface split feels right, and whether the multi-run response model is a direction worth pursuing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions