role: user) are sent when the user finishes speaking and contain the transcribed text.
Replica utterances (role: replica) are sent immediately when the replica begins speaking and contain the full LLM response text — including words the replica may not have actually spoken if it was interrupted. This makes them useful for quickly displaying the replica’s intended response.
conversation.utterance event (role=replica) will still contain the full intended response. To track only the words the replica actually spoke, use streaming utterance events, which progressively report spoken text and indicate interruptions.properties may include user_audio_analysis (tone/delivery) and/or user_visual_analysis (appearance and demeanor). These fields are only present when there is relevant analysis for that utterance.
This event includes a seq field for global ordering and a turn_idx field to identify which conversational turn the utterance belongs to. See Event Ordering and Turn Tracking for details.Message type indicates what product this event will be used for. In this case, the message_type will be conversation
"conversation"
This is the type of event that is being sent back. This field will be present on all events and can be used to distinguish between different event types.
"conversation.utterance"
A globally monotonic sequence number assigned to each event. Use this to determine the ordering of events — a higher seq means the event was sent later. This is useful for reconciling events that may arrive out of order.
42
The unique identifier for the conversation.
"c123456"
This is a unique identifier for a given utterance. In this case, it will be the utterance the replica is speaking.
"83294d9f-8306-491b-a284-791f56c8383f"
The conversation turn index. This value increments each time a conversation.respond interaction is received, and groups all events that belong to the same conversational turn. Use this to correlate events (utterances, tool calls, speaking state changes, etc.) that are part of the same turn.
3
This object contains the speech property (the contents of the utterance). When the speaker is the user and the persona uses Raven-1, it may also include user_audio_analysis and/or user_visual_analysis when relevant analysis is available.