Layers and Modes Overview
CVI provides an end-to-end pipeline that takes in a user audio & video input and outputs a realtime replica AV output. This pipeline is hyper optimized, with layers tightly coupled to achieve the lowest latency in the market. CVI is highly customizable though, with the ability to customize or disable layers as well as different modes being offered to best fit your use case.
Layers
Tavus provides the following customizable layers as part of the CVI pipeline:
Transport
Transport
- Video conferencing / end-to-end WebRTC, currently powered by Daily. It handles audio/visual input and output for CVI.
- We allow configurability for input and output, each with either audio/mic or visual/camera property. You can never disable the Transport layer.
Perception
Perception
User input video / screenshare can be processed using Raven-0, our advanced multimodal perception system, allowing the replica to see and respond to user expressions, environments and the content on screen. See more details for Raven.
Speech Recognition with Sparrow-0 (Semantic Interrupts)
Speech Recognition with Sparrow-0 (Semantic Interrupts)
An optimized ASR system powered by Sparrow-0, enabling incredibly fast and intelligent interrupts with real-time lexical and semantic analysis for precise, natural turn-taking.
LLM
LLM
Tavus provides ultra-low latency optimized LLMs or allows you to bring your own.
TTS
TTS
Tavus provides the TTS audio using a low-latency optimized voice model (powered by Cartesia), or allows you to use one of the other supported voice providers.
Realtime Replica
Realtime Replica
Tavus provides high-quality streaming replicas powered by our proprietary class of models: Phoenix.
Pipeline Modes
Tavus offers a number of modes that come with preconfigured layers as necessary for your use case. You can configure the pipeline mode in the Create Persona API.
Full
Full
Default and recommended option to optimize your multimodal interactions or enable Perception.
You have the option to bring your own ASR / LLM / TTS.
Echo
Echo
You can bypass Tavus Perception, ASR, turn-taking, and LLM and directly stream:
- Text into the TTS layer (text echo), or…
- Audio stream that the replica will repeat (audio echo). Audio stream can be a direct user mic input or base64.
You can also use this mode server-to-server, where your server connects to the Daily/webRTC room to provide audio and then forwards the video stream to your user.
Full Pipeline Mode (default and recommended)
By default, we recommend using the end-to-end pipeline in it’s entirety as it will provide the lowest latency and most optimized multimodal experience. We offer a number of LLMs (Llama3.3, OpenAI) that we’ve optimized within the end-to-end pipeline. With SLAs as fast as under 1s ---- you can access the world’s fastest utterance-to-utterance latency. You can load our LLMs full of your knowledge base and prompt them to your liking, as well as update the context live to simulate an async RAG application.
Custom LLM / Bring your own logic
Using a custom LLM is a great idea for those that already have a LLM or are building business logic that needs to intercept the input transcription and decide on the output. Using your own LLM will likely add latency, as the Tavus LLMs are hyper-optimized for low latency.
Note that the ‘Custom LLM’ mode doesn’t require an actual LLM. Any endpoint that will respond to chat completion requests in the required format can be used. For example, you could set up a server that takes in the completion requests and responds with predetermined responses, with no LLM involved at all.
Learn about how to use Custom LLM mode
Echo Mode
You can specify audio or text input for the replica to speak out. We only recommend this if your application does not have a need for speech recognition (voice) or perception capabilities, or have a very specific ASR/Perception pipeline that you must use. Using your own ASR is most often slower and less optimized than using the integrated Tavus pipeline.
You can use text or audio input interchangeably in Echo Mode. There are two possible configurations, based on microphone enablement in Transport layer.
Learn about how to use Echo Mode
Text or Audio (Base64) Echo
By turning off the microphone in the Transport Layer and using the Interactions Protocol, you can achieve Text and Audio (base64) echo behavior.
-
The Text Echo behavior allows you to bypass Tavus Perception, ASR, turn-taking, and LLM and directly send text into the TTS layer. This allows you to have a replica that speaks all the text you provide, as well as allows you to manually control interrupts.
-
The Audio (Base64) Echo behavior allows you to bypass all Layers except for the Realtime Replica Layer. In this configuration, the replica will speak the audio that you provide.
In order to send text or base64 encoded audio, you should use the Interactions Protocol.
Microphone Echo
By keeping the microphone on in the Transport Layer, you are able to bypass all layers in CVI and directly pass in an audio stream that the replica will repeat. In this mode interrupts are handled within your audio stream, any received audio will be generated with the replica.
We only recommend this if you have pre-generated audio you would like to use, have a voice-to-voice pipeline, or have a very specific voice requirement.