eventType conversation.perception_tool_call, a modality in data.properties ("vision" or "audio"), the tool name, and arguments.
Modality-specific payload:
modality: "audio" — Triggered by audio tools (audio_tool_prompt / audio_tools). arguments is a JSON string (e.g. "{\"reason\":\"The user said …\"}"). There is no frames array.modality: "vision" — Triggered by visual tools (visual_tool_prompt / visual_tools). arguments is an object with tool-defined fields. Includes a frames array of objects with data (base64-encoded JPEG) and mime_type (e.g. "image/jpeg") for the images that triggered the call.seq field for global ordering and a turn_idx field to identify which conversational turn the perception tool call belongs to. See Event Ordering and Turn Tracking for details.
frames with base64-encoded images. The data values in the example are shortened for readability.
Message type indicates what product this event will be used for. In this case, the message_type will be conversation
"conversation"
This is the type of event that is being sent back. This field will be present on all events and can be used to distinguish between different event types.
"conversation.perception_tool_call"
A globally monotonic sequence number assigned to each event. Use this to determine the ordering of events — a higher seq means the event was sent later. This is useful for reconciling events that may arrive out of order.
42
The unique identifier for the conversation.
"c123456"
The conversation turn index. This value increments each time a conversation.respond interaction is received, and groups all events that belong to the same conversational turn. Use this to correlate events (utterances, tool calls, speaking state changes, etc.) that are part of the same turn.
3
Contains the tool call payload. Includes modality (vision or audio), name, arguments, and for vision calls, frames.