If multiple voice nodes exist in a flow, the agent uses whichever node is directly connected to the Speak port at that point in the path. Each voice provider supports different configuration options—such as choosing dialects, tuning stylistic parameters, adjusting speed or other advanced timing settings. The following sections describe these nodes individually and outline how to customize their behavior for your use case.
Breez Voice
Breez Voice provides studio-quality Arabic and English text-to-speech, with support for multiple localized Arabic dialects and several voice options per dialect.Main Settings
Model
Select the Breez studio TTS model. The only option available at the moment is:- Breez Sirocco (Studio) – high-fidelity, natural-sounding speech tuned for real-time spoken interactions.
Dialect / Language
Choose the language or dialect you want the agent to speak. Options include:- English
- Arabic with 13 dialect choices (e.g., Lebanese, Bahraini, Egyptian, Saudi, etc.)
Voice
Choose the specific voice for the selected dialect or language. Each dialect provides multiple voice options (e.g., male/female, different tonal profiles).Voice availability varies depending on the chosen dialect.
Advanced Timing
Enable this option to reveal fine-grained timing controls for shaping how quickly or smoothly audio is delivered. These settings are useful when tuning responsiveness during turn-taking. When enabled, three controls appear:Idle Flush (Silent)
Timeout (in seconds) for flushing sentences when the TTS is not currently speaking.- Lower values → faster response, more interruptible
- Higher values → waits longer to gather fuller sentences before output
Idle Flush (Speaking)
Timeout for flushing sentences while the TTS is speaking.Higher values allow Breez Voice to complete more natural-sounding sentence fragments before sending audio.
Trailing Silence
Duration (in seconds) of silence appended at the end of generated audio.Used to prevent clipping or audio clicks at the sentence boundary.
- 0.00 → no extra silence
- 0.15 → minimal smoothing (recommended)
- ~0.50 → noticeably longer pause
Deepgram Voice
Deepgram Voice provides fast, high-quality text-to-speech with a wide range of English-only voices across two model families: Aura 1 and Aura 2. These models focus on clarity, natural pacing, and minimal latency—ideal for real-time conversational agents.Main Settings
Model
Select which Deepgram TTS model family to use. Options:- Aura 1 – earlier generation voice set
- Aura 2 – newer, more expressive and natural-sounding voices
Voice
Choose the specific voice used for speech output.Each Aura model includes several English voices (e.g., male/female, different tonal profiles). Voice availability depends on the selected model (Aura 1 or Aura 2).
Opt in to Deepgram MIP (50% discount)
Enable Deepgram’s Model Improvement Program to receive ~50% lower pricing. Your audio may be used by Deepgram to improve future models (per their policy).ElevenLabs Voice
ElevenLabs provides high-quality multilingual text-to-speech with expressive emotional control. Use this node when you want natural, humanlike speech with fine-tunable delivery characteristics.Main Settings
Language
Select the language the model should speak in.Languages include English, Arabic, Hindi, Spanish, French, Italian, and Indonesian.
Gender
Choose the desired voice gender.Some languages only support one gender; if an unsupported option is selected, the system automatically falls back to the closest available voice.
Voice
Pick a voice compatible with the selected language and gender.Voice availability varies per language.
Model
Choose which ElevenLabs TTS model to use:- Flash V2.5 (Fastest) — best for real-time interactions
- Turbo V2.5 — higher audio quality with slightly more latency
Advanced Options
Enabling Advanced Options reveals detailed settings for controlling expressiveness and delivery:Stability
Controls how expressive or consistent the voice is.- Lower values → more emotional, dynamic delivery
- Higher values → more consistent, but may sound monotone
Similarity Boost
Adjusts how closely the AI matches the original speaker’s voice.- Higher boosts clarity and consistency
- Extremely high values may introduce distortion
Style
Amplifies the expressive style of the voice.- Higher values exaggerate style but can increase latency
Speed
Controls the speaking rate (range: 0.8–1.2).Values between 0.9–1.1 sound most natural.
Use Speaker Boost
Enhances similarity to the original speaker.This may add slight latency.
OpenAI Voice
OpenAI’s built-in Text-to-Speech service for generating high-quality, natural-sounding audio responses.Main Settings
Model
Select which OpenAI TTS model to use. Each model offers a different balance of speed, latency, and audio fidelity.- GPT-4o Mini TTS (Latest) — Latest-generation model with strong quality and performance across most use cases.
- TTS-1 (Optimized for Speed) — Very low latency (~75 ms). Ideal for real-time conversational agents.
- TTS-1-HD (Optimized for Quality) — High-fidelity audio with reduced static. Best for audiobooks, podcasts, and professional narration. Higher latency.
Voice
Choose from OpenAI’s available preset voices.A variety of male and female voices are provided, each with its own tone and style.
The list shown in the editor includes full descriptions to help match the right voice to your experience.
Advanced Options
Speed
Controls the playback speed of the generated audio.- Range: 0.25× to 4.0×
- Default: 1.0×
- Notes:
- 0.9–1.1 typically sounds the most natural
- Lower speeds create slower, more deliberate speech
- Higher speeds produce faster, more energetic delivery

