Perception with Raven
Raven-0 is a revolutionary real-time multimodal vision and video understanding system that fundamentally reimagines how AI perceives and interacts with humans. Unlike traditional systems that rely on frame-by-frame analysis, Raven-0 implements a context-aware, human-like perception system that mirrors the primary visual cortex’s functioning.
Key Capabilities
Raven-0 provides advanced perception capabilities that go far beyond traditional vision systems:
Real-time Contextual Awareness
Real-time Contextual Awareness
Raven-0 continuously analyzes video streams to maintain ambient context, allowing for more natural and responsive interactions.
Multimodal Understanding
Multimodal Understanding
Process and understand multiple input channels simultaneously, including webcam feeds and screen sharing, creating a comprehensive understanding of the interaction context.
Human-like Emotion Understanding
Human-like Emotion Understanding
Goes beyond simple facial expression classification to understand emotions in context, considering situational, environmental, conversational and temporal factors.
Customizable Perception
Customizable Perception
Customize the perception layer to fit your specific use case.
Tool Calling Integration
Tool Calling Integration
Trigger automated actions based on visual cues through perception tools, which enables integrations with your existing systems and workflows.
How Raven Works
Raven-0 implements a dual-track vision processing system that mirrors human perception:
Ambient Perception
Ambient perception acts as the replica’s “eyes,” continuously processing and understanding the visual environment at a low level. This provides ambient context that informs the replica’s responses without requiring explicit queries.
- Default Queries: Raven automatically processes visual information to understand who the user is, what they look like, their emotional state, and other contextual information.
- Custom Queries: You can define custom visual queries that Raven will continuously monitor for, allowing for specialized use cases.
Active Perception
When specific visual information is needed, Raven can perform detailed on-demand analysis:
- Speculative Execution: Raven uses speculative execution to pre-process likely visual queries while the user is speaking, minimizing perceived latency.
Screenshare Vision
Raven processes screen content with higher detail retention, allowing for animations, dynamic content and pages to be captured. This way, you can share your calendar, documents, and other content with your replica, and switching between screens is seamless.
Perception Tools
Perception tools are a way to trigger automated actions based on visual cues. Whereas Ambient Perception allows the replica to be aware of visual cues in the context of the conversation, the tools allow your system to be aware of them. They are configured in such a way that the replica can trigger them when the visual cue is detected in near real-time.
When a perception tool is triggered, a Perception Tool Call event is broadcasted. This event contains the name of the tool, any arguments passed to it, and the encoded frames (in base64) that triggered the tool call. You may choose to store the frames for later analysis, or to use them to trigger other actions (eg. ID lookup, document processing, etc).
End-of-call Perception Analysis
At the end of a call, Raven will summarize the visual artifacts that were detected throughout the call. This is a feature that is only available when the persona has raven-0
specified in the perception layer, and will be broadcasted as a Perception Analysis event and separately as a conversation callback.
Use Cases
Telehealth
Telehealth
- Detect if the user is in a safe, private environment
- Recognize when ID cards or medical documents are shown
- Assess user distress or concerning emotional states
- Process visual information like test results shown via screen sharing
Customer Support
Customer Support
- Identify when products are shown to the camera
- Detect user frustration levels to adjust response tone
- Understand visual problems through screen sharing
Learning & Education
Learning & Education
- Provide real-time help on essays and homework shown via screen sharing
- Monitor student engagement and attention
- Ensure academic integrity
Roleplay & Training
Roleplay & Training
- Allow sales professionals to practice pitches with screen sharing
- Facilitate mock interviews with presentation feedback
- Assess energy level, charisma, and presentation skills
Configuring Raven
You can configure Raven’s behavior through the Create Persona API by adjusting the perception
parameters.
Perception Parameters
The perception model to use. Options include raven-0
for advanced multimodal perception or basic
for simpler vision capabilities, and off
to disable all perception.
Custom queries that Raven will continuously monitor for in the visual stream. These provide ambient context without requiring explicit prompting. These allow the replica to be aware of these additional visual cues.
A prompt that details how and when to use the tools that are passed to the perception layer. This helps the replica understand the context of the perception tools and grounds it.
Tools that can be triggered based on visual context, enabling automated actions in response to visual cues from your system.