Skip to main content
Avaluma AI is composed of two independently deployed components — the Avatar Server and the LiveKit Agent — that communicate through a shared LiveKit room. The Agent handles the conversational intelligence, while the Avatar Server handles GPU rendering. You can run both yourself or use Avaluma’s managed Avatar Server at https://api.avaluma.ai.

The Avatar Server

The Avatar Server is the rendering engine of Avaluma AI. It runs inside a Docker container with direct access to an NVIDIA GPU and performs the following work:
  • Loads your .hvia avatar files from the assets/avatars/ directory at startup
  • Receives audio from the LiveKit Agent via the avaluma-livekit-plugin
  • Renders the photorealistic avatar frame-by-frame, animating lip movement and facial expressions in sync with the audio
  • Publishes the resulting video track back into the LiveKit room, where any connected participant can subscribe to it
Resource requirements: Each simultaneous avatar session consumes approximately 2.5 GB of VRAM. A GPU with 6 GB VRAM can serve two concurrent sessions; scale up your GPU to support more. The server has been tested on Ampere, Ada Lovelace, and Blackwell architectures with CUDA 12. Hosting options:
OptionURL
Self-hostedhttp://localhost:8080 (or your domain with the reverse proxy)
Avaluma Managedhttps://api.avaluma.ai
The optional Caddy reverse proxy included in avatar-server/reverse_proxy/ automatically provisions and renews a TLS certificate for your domain, making the self-hosted option production-ready without extra configuration.

The LiveKit Agent

The LiveKit Agent runs a full voice AI pipeline inside a Docker container. It listens to microphone audio from participants in the LiveKit room and drives a conversation through the following stages:
Microphone → STT → LLM → TTS → Avaluma Avatar → Video stream
             │                         │
       AssemblyAI              Avatar Server
       (universal-           (animates .hvia
        streaming)              avatar file)
               LLM: OpenAI GPT-4.1-mini
               TTS: Cartesia Sonic-3
Each stage is pluggable via the AgentSession configuration in agent-1.py:
  • STT — AssemblyAI universal-streaming model transcribes the participant’s speech in real time
  • LLM — OpenAI gpt-4.1-mini generates a response to the transcript
  • TTS — Cartesia sonic-3 synthesises the response as speech audio
  • AvatarSession — the avaluma-livekit-plugin forwards the TTS audio to the Avatar Server, which animates the avatar and streams the video back into the room
The agent also applies Silero VAD for voice-activity detection, LiveKit BVC for background noise cancellation, and a multilingual end-of-turn detection model to know when the participant has finished speaking.
You can swap any stage for a different provider. Refer to the LiveKit Agents plugin directory for compatible STT, LLM, and TTS plugins.

External Audio (agent-2 Pattern)

The Avatar Server is not limited to the AgentSession pipeline. Any external service that holds a valid LiveKit token can stream audio directly to the avatar over a LiveKit DataStream on the topic lk.audio_stream. The avatar animates that audio without an Agent or AgentSession involved at all.
WAV file → DataStream → Avaluma Avatar → Video stream
(external sender,               │
 own LiveKit token)      Avatar Server
This pattern is demonstrated in agent-2 and is useful when you already have a speech synthesis service, a pre-recorded script, or any audio source outside the standard pipeline.
The external sender only needs a LiveKit token and the DataStream topic lk.audio_stream — no Avaluma SDK is required on the sender side.

Component Summary

ComponentResponsibilityHosting
Avatar ServerGPU rendering, lip-sync, video streamingSelf-hosted or api.avaluma.ai
LiveKit AgentSTT → LLM → TTS voice AI pipelineSelf-hosted Docker container
LiveKit RoomShared media layer connecting both componentsLiveKit Cloud or self-hosted

Explore Further

Avatar Server

Learn how to configure GPU resources, manage avatar files, and enable HTTPS.

LiveKit Agent

Dive into the voice AI pipeline, environment variables, and the external audio pattern.