Skip to main content
Transcriber nodes convert caller audio into text, which becomes the agent’s “hearing.”
Every agent or workflow can inherit the default transcriber from the Start node, or you can override it by attaching a transcriber node directly to an Agent or the Start node.
Different providers offer different strengths — multilingual support, speed, accuracy, or domain specialization. The settings you configure here directly impact how quickly and accurately your agent understands the caller.

Breez Ears

Breez Ears is our unified first-party speech-to-text engine, offering Echo (multilingual) and Wave (single-language, enhanced accuracy) models.
This node allows you to control model selection, language configuration, contextual boosts (Echo only), and endpoint/utterance detection behavior.

Main Settings

Model

Choose which Breez transcriber to use. Options:
  • Echo Universal – 60+ Languages
    Multilingual model with automatic language detection.
  • Wave Pro – Enhanced Accuracy
    Single-language, highest accuracy with deeper processing.
  • Wave Lite – Fast & Efficient
    Single-language, optimized for maximum speed and low latency.
What this controls:
This determines the underlying STT model powering all recognition for this agent or workflow.
Echo is ideal for multilingual or unknown-language calls; Wave models are ideal for predictable languages and maximum precision.

Language Hints (Echo only)

Appears only when Echo is selected. Suggest one or more expected languages to improve recognition. Selecting one or more languages helps the model narrow down accent, vocabulary, and acoustic patterns. Leave empty to enable full automatic detection across all 60+ supported languages. Options:
A selectable list of 30+ languages (e.g., English, Spanish, Hindi, Arabic, etc.).
When to use hints:
  • Use hints when you know the likely caller language(s).
  • Leave empty for international or uncertain-language workflows.

Language (Wave only)

Appears only when Wave Pro or Wave Lite is selected. Since Wave models support only one language at a time, this field specifies the exact language you expect from the caller. Typical use cases:
  • You’re operating in a single-language region.
  • The workflow is designed for a predictable language (e.g., English support line).
  • You want to maximize Wave’s accuracy and speed by removing ambiguity.

Context (Optional) (Echo only)

Appears only when Echo is selected. Provide additional domain-specific context that improves transcription of names, jargon, product terminology, etc. Examples:
  • “medical terms, product names, technical jargon”
  • “car models, insurance terms, claim numbers”
You can insert workflow variables using @variableName, allowing the model to dynamically adapt based on runtime data.

Server EOU Mode (Wave only)

Appears only when Wave Pro or Wave Lite is selected. Controls the end-of-utterance (EOU) strategy used to determine when a transcript should finalize and the agent should respond. In practice:
  • Faster EOU → quicker agent responses but may finalize too early.
  • Slower EOU → more complete transcription but slightly higher latency.

Endpoint Detection Mode (Echo only)

Controls how Breez decides when the caller has finished speaking. Options:
  • Advanced
    Uses enhanced Breez endpoint detection for more accurate turn-taking.
    Best for natural conversations where interruptions, overlaps, and short pauses are common.
  • Standard
    Uses Breez’s semantic/VAD fallback system, providing broader language steering and compatibility.
What this setting affects:
  • How quickly the agent begins responding
  • Whether brief pauses are interpreted as the user “being done”
  • Overall smoothness of barge-in and turn detection

Deepgram Ears

Deepgram Ears integrates Deepgram’s real-time Speech-to-Text engine into your workflow.
Depending on the selected model, different settings become available — Flux models expose turn-taking and latency controls, while Nova models expose formatting and transcription options.

Main Settings

Model

Select which Deepgram model to use. Options:
  • Flux General (English Only)
    Real-time model optimized for speed and responsive turn-taking. English only.
  • Nova 3 General
    High-accuracy multilingual model.
  • Nova 3 Medical (English Only)
    Medical-tuned model for clinical terminology.

Opt in to Deepgram MIP (50% discount)

Enable Deepgram’s Model Improvement Program to receive ~50% lower pricing.
Your audio may be used by Deepgram to improve future models (per their policy).

Flux Model Settings

Shown when Flux General is selected Flux exposes Deepgram’s real-time turn-taking controls, allowing you to tune how aggressively or conservatively the system decides a user has finished speaking.

End of Turn Threshold

Controls how confident the model must be that the speaker has finished talking (range 0.5–0.9).
  • Lower values (~0.5–0.6): Faster responses, but increased risk of cutting the user off.
  • Higher values (~0.7–0.9): Safer, more reliable detection, but slightly slower responses.
Recommended: 0.6–0.7

End of Turn Timeout

Maximum allowed silence before forcing a turn end (1–10 seconds). Flux uses semantic detection first; this timeout only triggers if semantic detection is uncertain. Recommended: ~3 seconds (a good safety net without creating long pauses)

Eager Response Mode

Controls whether the agent begins preparing its response before Deepgram is fully certain the user has finished speaking. Options:
  • Off – Most conservative. Waits for high confidence.
  • Conservative – Starts preparing slightly earlier, still cautious.
  • Balanced – Good mix of responsiveness and safety.
  • Aggressive – Fastest responses; may occasionally interrupt if the user pauses mid-sentence.
Use this to tune the speed/safety trade-off based on your use case.

Nova Model Settings

Shown when Nova 3 General or Nova 3 Medical is selected Nova models focus on transcription quality and formatting, rather than turn detection controls.

Language

  • Nova 3 General: Choose Auto (multilingual) or any of 35+ supported languages.
  • Nova 3 Medical: Only English is available.

Punctuation

Automatically inserts punctuation marks into the transcript.

Filler Words

Includes utterances like “uh”, “um”, and similar fillers.
Disable if you want cleaner transcripts; enable if fillers are important for intent or analysis.

Smart Format

Automatically formats structured content such as dates, times, addresses, or phone numbers

Profanity Filter

Masks or filters out offensive language.

Numerals

Converts written-out numbers into numeric digits.
Example: “one hundred twenty three” → “123”

ElevenLabs Ears

ElevenLabs Ears uses ElevenLabs’ Scribe realtime speech-to-text engine, offering fast and accurate multilingual transcription. This node allows you to configure the model, language behavior, and optional audio-event tagging.

Main Settings

Model

Select the ElevenLabs Scribe model. Options:
  • Scribe V2 Realtime (Latest) — the latest realtime model with improved accuracy and latency.
(This is currently the only available model option.)

Language

Choose a specific language or let ElevenLabs automatically detect it. Options:
  • Auto (detect language) — ElevenLabs will automatically identify the spoken language.
  • Specific language — e.g., English (US), Spanish, French, German, etc. (full list available in the dropdown).
  • Custom (enter code) — manually provide a language code for advanced or unlisted languages.
  • Not Selected (legacy) — only for compatibility with older configurations.
Guidance:
Use Auto for global workflows or uncertain language scenarios; choose a specific language when accuracy is the priority; use Custom for languages not explicitly listed.

Custom Language Code

Only visible when Language = Custom Enter an ISO 639-1 or BCP-47 language code to explicitly control transcription behavior. Examples:
  • en — English
  • es-MX — Spanish (Mexico)
  • pt-BR — Portuguese (Brazil)
You can also insert runtime variables using @variableName if the language should be determined dynamically during the workflow.

Tag Audio Events

Whether to include markers for non-speech sounds in the transcript. Examples of audio events:
  • (laughter)
  • (cough)
  • (music)
On: Includes contextual tags in the transcription.
Off: Produces clean text only, without annotations.

OpenAI Ears

OpenAI Ears provides streaming speech-to-text powered by the GPT-4o family of transcription models.
It supports optional prompting, language control, and microphone-optimized noise reduction.

Main Settings

Prompt Mode

Choose whether to supply a prompt that guides transcription output. Prompts allow you to bias the model toward correct terminology, formatting, or stylistic rules. Options:
  • No Prompt — Default behavior with no custom guidance
  • Custom Prompt — Enables the Prompt field

Prompt (shown only when Custom Prompt is selected)

A text prompt that shapes how the transcription is produced. Prompts can help with:
  • Correcting domain-specific words or acronyms
  • Preserving punctuation or enforcing specific writing styles
  • Keeping filler words when desired
  • Choosing language variants (e.g., simplified vs. traditional Chinese)
  • Providing contextual hints for better continuity
Note: Custom prompts are only supported with GPT-4o models (not Mini).

Model

Choose which OpenAI speech-to-text model to use. Options:
  • GPT-4o Mini Transcribe — Fastest and lowest cost; ideal for real-time streaming
  • GPT-4o Transcribe — Higher accuracy, still optimized for real-time use

Language

Select the language for transcription.
  • Auto (Detect) automatically identifies the spoken language
  • Manual selection includes 50+ supported languages
  • Default is English
Selecting the language improves accuracy when the expected language is known.

Noise Reduction Type

Optimizes transcription for the caller’s microphone environment. Options:
  • Far Field — Best for microphones farther from the speaker
    • Typical for browser calls or laptop/desktop microphones
  • Near Field — Best for close-talk microphones
    • Phone-to-ear, headset boom mics
    • All SIP calls automatically use Near Field
Choosing the correct mode improves clarity and reduces background noise.