Skip to main content
Voice nodes control how your agent speaks during a call. They determine the Text-to-Speech (TTS) model, language, voice style, and advanced behavior used to generate audio responses. Different providers offer different strengths: All voice nodes plug into the Speak connector on the Start node or Agent nodes. The selected voice is used whenever the agent responds aloud.
If multiple voice nodes exist in a flow, the agent uses whichever node is directly connected to the Speak port at that point in the path.
Each voice provider supports different configuration options—such as choosing dialects, tuning stylistic parameters, adjusting speed or other advanced timing settings. The following sections describe these nodes individually and outline how to customize their behavior for your use case.

Breez Voice

Breez Voice provides studio-quality Arabic and English text-to-speech, with support for multiple localized Arabic dialects and several voice options per dialect.

Main Settings

Model

Select the Breez studio TTS model. The only option available at the moment is:
  • Breez Sirocco (Studio) – high-fidelity, natural-sounding speech tuned for real-time spoken interactions.

Dialect / Language

Choose the language or dialect you want the agent to speak. Options include:
  • English
  • Arabic with 13 dialect choices (e.g., Lebanese, Bahraini, Egyptian, Saudi, etc.)
Selecting a dialect automatically filters the set of available voices.

Voice

Choose the specific voice for the selected dialect or language. Each dialect provides multiple voice options (e.g., male/female, different tonal profiles).
Voice availability varies depending on the chosen dialect.

Advanced Timing

Enable this option to reveal fine-grained timing controls for shaping how quickly or smoothly audio is delivered. These settings are useful when tuning responsiveness during turn-taking. When enabled, three controls appear:

Idle Flush (Silent)

Timeout (in seconds) for flushing sentences when the TTS is not currently speaking.
  • Lower values → faster response, more interruptible
  • Higher values → waits longer to gather fuller sentences before output

Idle Flush (Speaking)

Timeout for flushing sentences while the TTS is speaking.
Higher values allow Breez Voice to complete more natural-sounding sentence fragments before sending audio.

Trailing Silence

Duration (in seconds) of silence appended at the end of generated audio.
Used to prevent clipping or audio clicks at the sentence boundary.
  • 0.00 → no extra silence
  • 0.15 → minimal smoothing (recommended)
  • ~0.50 → noticeably longer pause

Deepgram Voice

Deepgram Voice provides fast, high-quality text-to-speech with a wide range of English-only voices across two model families: Aura 1 and Aura 2. These models focus on clarity, natural pacing, and minimal latency—ideal for real-time conversational agents.

Main Settings

Model

Select which Deepgram TTS model family to use. Options:
  • Aura 1 – earlier generation voice set
  • Aura 2 – newer, more expressive and natural-sounding voices
The model you choose determines which voice options will appear in the next field.

Voice

Choose the specific voice used for speech output.
Each Aura model includes several English voices (e.g., male/female, different tonal profiles). Voice availability depends on the selected model (Aura 1 or Aura 2).

Opt in to Deepgram MIP (50% discount)

Enable Deepgram’s Model Improvement Program to receive ~50% lower pricing. Your audio may be used by Deepgram to improve future models (per their policy).

ElevenLabs Voice

ElevenLabs provides high-quality multilingual text-to-speech with expressive emotional control. Use this node when you want natural, humanlike speech with fine-tunable delivery characteristics.

Main Settings

Language

Select the language the model should speak in.
Languages include English, Arabic, Hindi, Spanish, French, Italian, and Indonesian.

Gender

Choose the desired voice gender.
Some languages only support one gender; if an unsupported option is selected, the system automatically falls back to the closest available voice.

Voice

Pick a voice compatible with the selected language and gender.
Voice availability varies per language.

Model

Choose which ElevenLabs TTS model to use:
  • Flash V2.5 (Fastest) — best for real-time interactions
  • Turbo V2.5 — higher audio quality with slightly more latency

Advanced Options

Enabling Advanced Options reveals detailed settings for controlling expressiveness and delivery:

Stability

Controls how expressive or consistent the voice is.
  • Lower values → more emotional, dynamic delivery
  • Higher values → more consistent, but may sound monotone

Similarity Boost

Adjusts how closely the AI matches the original speaker’s voice.
  • Higher boosts clarity and consistency
  • Extremely high values may introduce distortion

Style

Amplifies the expressive style of the voice.
  • Higher values exaggerate style but can increase latency

Speed

Controls the speaking rate (range: 0.8–1.2).
Values between 0.9–1.1 sound most natural.

Use Speaker Boost

Enhances similarity to the original speaker.
This may add slight latency.

OpenAI Voice

OpenAI’s built-in Text-to-Speech service for generating high-quality, natural-sounding audio responses.

Main Settings

Model

Select which OpenAI TTS model to use. Each model offers a different balance of speed, latency, and audio fidelity.
  • GPT-4o Mini TTS (Latest) — Latest-generation model with strong quality and performance across most use cases.
  • TTS-1 (Optimized for Speed) — Very low latency (~75 ms). Ideal for real-time conversational agents.
  • TTS-1-HD (Optimized for Quality) — High-fidelity audio with reduced static. Best for audiobooks, podcasts, and professional narration. Higher latency.

Voice

Choose from OpenAI’s available preset voices.
A variety of male and female voices are provided, each with its own tone and style.
The list shown in the editor includes full descriptions to help match the right voice to your experience.

Advanced Options

Speed

Controls the playback speed of the generated audio.
  • Range: 0.25× to 4.0×
  • Default: 1.0×
  • Notes:
    • 0.9–1.1 typically sounds the most natural
    • Lower speeds create slower, more deliberate speech
    • Higher speeds produce faster, more energetic delivery