Perception Tool Call Event

This is an event broadcasted by Tavus. A perception tool call event is broadcast when a perception tool is triggered by Raven based on visual or audio input. The event always includes eventType conversation.perception_tool_call, a modality in data.properties ("vision" or "audio"), the tool name, and arguments. Modality-specific payload:

modality: "audio" — Triggered by audio tools (audio_tool_prompt / audio_tools). arguments is a JSON string (e.g. "{\"reason\":\"The user said …\"}"). There is no frames array.
modality: "vision" — Triggered by visual tools (visual_tool_prompt / visual_tools). arguments is an object with tool-defined fields. Includes a frames array of objects with data (base64-encoded JPEG) and mime_type (e.g. "image/jpeg") for the images that triggered the call.

Perception tool calls can be used to trigger automated actions in response to visual or audio cues detected by the Raven perception system. For more on configuring perception tool calls, see Tool Calling for Perception and Perception. This event includes a seq field for global ordering and a turn_idx field to identify which conversational turn the perception tool call belongs to. See Event Ordering and Turn Tracking for details.

Example: audio tool call

When an audio tool is triggered (e.g. sarcasm detection), the event looks like:

{
  "timestamp": "2026-03-02T21:51:47.194Z",
  "eventType": "conversation.perception_tool_call",
  "data": {
    "conversation_id": "c58b46f8646d943f",
    "event_type": "conversation.perception_tool_call",
    "message_type": "conversation",
    "seq": 17,
    "turn_idx": 2,
    "properties": {
      "arguments": "{\"reason\":\"The user said \\\"well, yeah\\\"\"}",
      "modality": "audio",
      "name": "notify_sarcasm_detected"
    }
  }
}

Example: vision tool call

When a visual tool is triggered (e.g. hat detection), the event includes frames with base64-encoded images. The data values in the example are shortened for readability.

{
  "timestamp": "2026-03-02T21:51:49.730Z",
  "eventType": "conversation.perception_tool_call",
  "data": {
    "conversation_id": "c58b46f8646d943f",
    "event_type": "conversation.perception_tool_call",
    "message_type": "conversation",
    "seq": 18,
    "turn_idx": 2,
    "properties": {
      "arguments": {
        "hat_type": "baseball cap"
      },
      "frames": [
        { "data": "<base64-encoded-jpeg>", "mime_type": "image/jpeg" },
        { "data": "<base64-encoded-jpeg>", "mime_type": "image/jpeg" }
      ],
      "modality": "vision",
      "name": "notify_hat_detected"
    }
  }
}

message_type

string

Message type indicates what product this event will be used for. In this case, the message_type will be conversation

Example:

"conversation"

event_type

string

This is the type of event that is being sent back. This field will be present on all events and can be used to distinguish between different event types.

Example:

"conversation.perception_tool_call"

seq

integer

A globally monotonic sequence number assigned to each event. Use this to determine the ordering of events — a higher seq means the event was sent later. This is useful for reconciling events that may arrive out of order.

Example:

42

conversation_id

string

The unique identifier for the conversation.

Example:

"c123456"

turn_idx

integer

The conversation turn index. This value increments each time a conversation.respond interaction is received, and groups all events that belong to the same conversational turn. Use this to correlate events (utterances, tool calls, speaking state changes, etc.) that are part of the same turn.

Example:

3

properties

object

Contains the tool call payload. Includes modality (vision or audio), name, arguments, and for vision calls, frames.

Show child attributes

Getting Started

Onboarding Guide

Conversational Video Interface

Replica

Video Generation

Resources

Perception Tool Call Event

Example: audio tool call

Example: vision tool call

Getting Started

Onboarding Guide

Conversational Video Interface

Replica

Video Generation

Resources

​Example: audio tool call

​Example: vision tool call

Example: audio tool call

Example: vision tool call