Every agent or workflow can inherit the default transcriber from the Start node, or you can override it by attaching a transcriber node directly to an Agent or the Start node. Different providers offer different strengths — multilingual support, speed, accuracy, or domain specialization. The settings you configure here directly impact how quickly and accurately your agent understands the caller.
Breez Ears
Breez Ears is our unified first-party speech-to-text engine, offering Echo (multilingual) and Wave (single-language, enhanced accuracy) models.This node allows you to control model selection, language configuration, contextual boosts (Echo only), and endpoint/utterance detection behavior.
Main Settings
Model
Choose which Breez transcriber to use. Options:- Echo Universal – 60+ Languages
Multilingual model with automatic language detection. - Wave Pro – Enhanced Accuracy
Single-language, highest accuracy with deeper processing. - Wave Lite – Fast & Efficient
Single-language, optimized for maximum speed and low latency.
This determines the underlying STT model powering all recognition for this agent or workflow.
Echo is ideal for multilingual or unknown-language calls; Wave models are ideal for predictable languages and maximum precision.
Language Hints (Echo only)
Appears only when Echo is selected. Suggest one or more expected languages to improve recognition. Selecting one or more languages helps the model narrow down accent, vocabulary, and acoustic patterns. Leave empty to enable full automatic detection across all 60+ supported languages. Options:A selectable list of 30+ languages (e.g., English, Spanish, Hindi, Arabic, etc.). When to use hints:
- Use hints when you know the likely caller language(s).
- Leave empty for international or uncertain-language workflows.
Language (Wave only)
Appears only when Wave Pro or Wave Lite is selected. Since Wave models support only one language at a time, this field specifies the exact language you expect from the caller. Typical use cases:- You’re operating in a single-language region.
- The workflow is designed for a predictable language (e.g., English support line).
- You want to maximize Wave’s accuracy and speed by removing ambiguity.
Context (Optional) (Echo only)
Appears only when Echo is selected. Provide additional domain-specific context that improves transcription of names, jargon, product terminology, etc. Examples:- “medical terms, product names, technical jargon”
- “car models, insurance terms, claim numbers”
@variableName, allowing the model to dynamically adapt based on runtime data.
Server EOU Mode (Wave only)
Appears only when Wave Pro or Wave Lite is selected. Controls the end-of-utterance (EOU) strategy used to determine when a transcript should finalize and the agent should respond. In practice:- Faster EOU → quicker agent responses but may finalize too early.
- Slower EOU → more complete transcription but slightly higher latency.
Endpoint Detection Mode (Echo only)
Controls how Breez decides when the caller has finished speaking. Options:-
Advanced
Uses enhanced Breez endpoint detection for more accurate turn-taking.
Best for natural conversations where interruptions, overlaps, and short pauses are common. -
Standard
Uses Breez’s semantic/VAD fallback system, providing broader language steering and compatibility.
- How quickly the agent begins responding
- Whether brief pauses are interpreted as the user “being done”
- Overall smoothness of barge-in and turn detection
Deepgram Ears
Deepgram Ears integrates Deepgram’s real-time Speech-to-Text engine into your workflow.Depending on the selected model, different settings become available — Flux models expose turn-taking and latency controls, while Nova models expose formatting and transcription options.
Main Settings
Model
Select which Deepgram model to use. Options:- Flux General (English Only)
Real-time model optimized for speed and responsive turn-taking. English only. - Nova 3 General
High-accuracy multilingual model. - Nova 3 Medical (English Only)
Medical-tuned model for clinical terminology.
Opt in to Deepgram MIP (50% discount)
Enable Deepgram’s Model Improvement Program to receive ~50% lower pricing.Your audio may be used by Deepgram to improve future models (per their policy).
Flux Model Settings
Shown when Flux General is selected Flux exposes Deepgram’s real-time turn-taking controls, allowing you to tune how aggressively or conservatively the system decides a user has finished speaking.End of Turn Threshold
Controls how confident the model must be that the speaker has finished talking (range 0.5–0.9).- Lower values (~0.5–0.6): Faster responses, but increased risk of cutting the user off.
- Higher values (~0.7–0.9): Safer, more reliable detection, but slightly slower responses.
End of Turn Timeout
Maximum allowed silence before forcing a turn end (1–10 seconds). Flux uses semantic detection first; this timeout only triggers if semantic detection is uncertain. Recommended: ~3 seconds (a good safety net without creating long pauses)Eager Response Mode
Controls whether the agent begins preparing its response before Deepgram is fully certain the user has finished speaking. Options:- Off – Most conservative. Waits for high confidence.
- Conservative – Starts preparing slightly earlier, still cautious.
- Balanced – Good mix of responsiveness and safety.
- Aggressive – Fastest responses; may occasionally interrupt if the user pauses mid-sentence.
Nova Model Settings
Shown when Nova 3 General or Nova 3 Medical is selected Nova models focus on transcription quality and formatting, rather than turn detection controls.Language
- Nova 3 General: Choose Auto (multilingual) or any of 35+ supported languages.
- Nova 3 Medical: Only English is available.
Punctuation
Automatically inserts punctuation marks into the transcript.Filler Words
Includes utterances like “uh”, “um”, and similar fillers.Disable if you want cleaner transcripts; enable if fillers are important for intent or analysis.
Smart Format
Automatically formats structured content such as dates, times, addresses, or phone numbersProfanity Filter
Masks or filters out offensive language.Numerals
Converts written-out numbers into numeric digits.Example: “one hundred twenty three” → “123”
ElevenLabs Ears
ElevenLabs Ears uses ElevenLabs’ Scribe realtime speech-to-text engine, offering fast and accurate multilingual transcription. This node allows you to configure the model, language behavior, and optional audio-event tagging.Main Settings
Model
Select the ElevenLabs Scribe model. Options:- Scribe V2 Realtime (Latest) — the latest realtime model with improved accuracy and latency.
Language
Choose a specific language or let ElevenLabs automatically detect it. Options:- Auto (detect language) — ElevenLabs will automatically identify the spoken language.
- Specific language — e.g., English (US), Spanish, French, German, etc. (full list available in the dropdown).
- Custom (enter code) — manually provide a language code for advanced or unlisted languages.
- Not Selected (legacy) — only for compatibility with older configurations.
Use Auto for global workflows or uncertain language scenarios; choose a specific language when accuracy is the priority; use Custom for languages not explicitly listed.
Custom Language Code
Only visible when Language = Custom Enter an ISO 639-1 or BCP-47 language code to explicitly control transcription behavior. Examples:en— Englishes-MX— Spanish (Mexico)pt-BR— Portuguese (Brazil)
@variableName if the language should be determined dynamically during the workflow.
Tag Audio Events
Whether to include markers for non-speech sounds in the transcript. Examples of audio events:- (laughter)
- (cough)
- (music)
Off: Produces clean text only, without annotations.
OpenAI Ears
OpenAI Ears provides streaming speech-to-text powered by the GPT-4o family of transcription models.It supports optional prompting, language control, and microphone-optimized noise reduction.
Main Settings
Prompt Mode
Choose whether to supply a prompt that guides transcription output. Prompts allow you to bias the model toward correct terminology, formatting, or stylistic rules. Options:- No Prompt — Default behavior with no custom guidance
- Custom Prompt — Enables the Prompt field
Prompt (shown only when Custom Prompt is selected)
A text prompt that shapes how the transcription is produced. Prompts can help with:- Correcting domain-specific words or acronyms
- Preserving punctuation or enforcing specific writing styles
- Keeping filler words when desired
- Choosing language variants (e.g., simplified vs. traditional Chinese)
- Providing contextual hints for better continuity
Model
Choose which OpenAI speech-to-text model to use. Options:- GPT-4o Mini Transcribe — Fastest and lowest cost; ideal for real-time streaming
- GPT-4o Transcribe — Higher accuracy, still optimized for real-time use
Language
Select the language for transcription.- Auto (Detect) automatically identifies the spoken language
- Manual selection includes 50+ supported languages
- Default is English
Noise Reduction Type
Optimizes transcription for the caller’s microphone environment. Options:- Far Field — Best for microphones farther from the speaker
- Typical for browser calls or laptop/desktop microphones
- Near Field — Best for close-talk microphones
- Phone-to-ear, headset boom mics
- All SIP calls automatically use Near Field

