Perception with Raven

Raven-0 is a revolutionary real-time multimodal vision and video understanding system that fundamentally reimagines how AI perceives and interacts with humans. Unlike traditional systems that rely on frame-by-frame analysis, Raven-0 implements a context-aware, human-like perception system that mirrors the primary visual cortex’s functioning.

By default we always recommend to use as much of the CVI end-to-end pipeline as possible to guarantee the lowest latency and provide the best experience for your customers.

Key Capabilities

Raven-0 provides advanced perception capabilities that go far beyond traditional vision systems:

Real-time Contextual Awareness

Multimodal Understanding

Human-like Emotion Understanding

Customizable Perception

Tool Calling Integration

How Raven Works

Raven-0 implements a dual-track vision processing system that mirrors human perception:

Ambient Perception

Ambient perception acts as the replica’s “eyes,” continuously processing and understanding the visual environment at a low level. This provides ambient context that informs the replica’s responses without requiring explicit queries.

Default Queries: Raven automatically processes visual information to understand who the user is, what they look like, their emotional state, and other contextual information.
Custom Queries: You can define custom visual queries that Raven will continuously monitor for, allowing for specialized use cases.

Active Perception

When specific visual information is needed, Raven can perform detailed on-demand analysis:

Speculative Execution: Raven uses speculative execution to pre-process likely visual queries while the user is speaking, minimizing perceived latency.

Screenshare Vision

Raven processes screen content with higher detail retention, allowing for animations, dynamic content and pages to be captured. This way, you can share your calendar, documents, and other content with your replica, and switching between screens is seamless.

Perception Tools

Perception tools are a way to trigger automated actions based on visual cues. Whereas Ambient Perception allows the replica to be aware of visual cues in the context of the conversation, the tools allow your system to be aware of them. They are configured in such a way that the replica can trigger them when the visual cue is detected in near real-time.

When a perception tool is triggered, a Perception Tool Call event is broadcasted. This event contains the name of the tool, any arguments passed to it, and the encoded frames (in base64) that triggered the tool call. You may choose to store the frames for later analysis, or to use them to trigger other actions (eg. ID lookup, document processing, etc).

End-of-call Perception Analysis

At the end of a call, Raven will summarize the visual artifacts that were detected throughout the call. This is a feature that is only available when the persona has raven-0 specified in the perception layer, and will be broadcasted as a Perception Analysis event and separately as a conversation callback.

Use Cases

Telehealth

Customer Support

Learning & Education

Roleplay & Training

Configuring Raven

You can configure Raven’s behavior through the Create Persona API by adjusting the perception parameters.

Perception Parameters

layers.perception.perception_model

string

The perception model to use. Options include raven-0 for advanced multimodal perception or basic for simpler vision capabilities, and off to disable all perception.

layers.perception.ambient_awareness_queries

array

Custom queries that Raven will continuously monitor for in the visual stream. These provide ambient context without requiring explicit prompting. These allow the replica to be aware of these additional visual cues.

layers.perception.perception_tool_prompt

string

A prompt that details how and when to use the tools that are passed to the perception layer. This helps the replica understand the context of the perception tools and grounds it.

layers.perception.perception_tools

array

Tools that can be triggered based on visual context, enabling automated actions in response to visual cues from your system.

Example Configuration

{
  "layers": {
    "perception": {
        "perception_model": "raven-0",
        "ambient_awareness_queries": [
            "Is the user showing an ID card?",
            "Is the user wearing a mask?"
        ],
        "perception_tool_prompt": "You have a tool to notify the system when an ID card is detected, named `notify_if_id_shown`.",
        "perception_tools": [
          {
            "type": "function",
            "function": {
              "name": "notify_if_id_shown",
              "description": "Use this function when a drivers license or passport is detected in the image with high confidence. After collecting the ID, internally use final_ask()",
              "parameters": {
                "type": "object",
                "properties": {
                  "id_type": {
                    "type": "string",
                    "description": "best guess on what type of ID it is",
                  },
                },
                "required": ["id_type"],
              },
            },
          },
        ]
      }
    }
  }

Replicas

Conversational Video Interface

Video Generation

Lipsync

Troubleshooting

Resources

Perception with Raven

Key Capabilities

How Raven Works

Ambient Perception

Active Perception

Screenshare Vision

Perception Tools

End-of-call Perception Analysis

Use Cases

Configuring Raven

Perception Parameters

Example Configuration

Replicas

Conversational Video Interface

Video Generation

Lipsync

Troubleshooting

Resources

​Key Capabilities

​How Raven Works

​Ambient Perception

​Active Perception

​Screenshare Vision

​Perception Tools

​End-of-call Perception Analysis

​Use Cases

​Configuring Raven

​Perception Parameters

​Example Configuration

Key Capabilities

How Raven Works

Ambient Perception

Active Perception

Screenshare Vision

Perception Tools

End-of-call Perception Analysis

Use Cases

Configuring Raven

Perception Parameters

Example Configuration