Transcriber Nodes - HeyBreez Docs

Transcriber nodes convert caller audio into text, which becomes the agent’s “hearing.”
Every agent or workflow can inherit the default transcriber from the Start node, or you can override it by attaching a transcriber node directly to an Agent or the Start node. Different providers offer different strengths — multilingual support, speed, accuracy, or domain specialization. The settings you configure here directly impact how quickly and accurately your agent understands the caller.

Breez Ears

Breez Ears is our unified first-party speech-to-text engine, offering Echo (multilingual) and Wave (single-language, enhanced accuracy) models.
This node allows you to control model selection, language configuration, contextual boosts (Echo only), and endpoint/utterance detection behavior.

Main Settings

Model

Choose which Breez transcriber to use. Options:

Echo Universal – 60+ Languages
Multilingual model with automatic language detection.
Wave Pro – Enhanced Accuracy
Single-language, highest accuracy with deeper processing.
Wave Lite – Fast & Efficient
Single-language, optimized for maximum speed and low latency.

What this controls:
This determines the underlying STT model powering all recognition for this agent or workflow.
Echo is ideal for multilingual or unknown-language calls; Wave models are ideal for predictable languages and maximum precision.

Language Hints (Echo only)

Appears only when Echo is selected. Suggest one or more expected languages to improve recognition. Selecting one or more languages helps the model narrow down accent, vocabulary, and acoustic patterns. Leave empty to enable full automatic detection across all 60+ supported languages. Options:
A selectable list of 30+ languages (e.g., English, Spanish, Hindi, Arabic, etc.). When to use hints:

Use hints when you know the likely caller language(s).
Leave empty for international or uncertain-language workflows.

Language (Wave only)

Appears only when Wave Pro or Wave Lite is selected. Since Wave models support only one language at a time, this field specifies the exact language you expect from the caller. Typical use cases:

You’re operating in a single-language region.
The workflow is designed for a predictable language (e.g., English support line).
You want to maximize Wave’s accuracy and speed by removing ambiguity.

Context (Optional) (Echo only)

Appears only when Echo is selected. Provide additional domain-specific context that improves transcription of names, jargon, product terminology, etc. Examples:

“medical terms, product names, technical jargon”
“car models, insurance terms, claim numbers”

You can insert workflow variables using @variableName, allowing the model to dynamically adapt based on runtime data.

Server EOU Mode (Wave only)

Appears only when Wave Pro or Wave Lite is selected. Controls the end-of-utterance (EOU) strategy used to determine when a transcript should finalize and the agent should respond. In practice:

Faster EOU → quicker agent responses but may finalize too early.
Slower EOU → more complete transcription but slightly higher latency.

Endpoint Detection Mode (Echo only)

Controls how Breez decides when the caller has finished speaking. Options:

Advanced
Uses enhanced Breez endpoint detection for more accurate turn-taking.
Best for natural conversations where interruptions, overlaps, and short pauses are common.
Standard
Uses Breez’s semantic/VAD fallback system, providing broader language steering and compatibility.

What this setting affects:

How quickly the agent begins responding
Whether brief pauses are interpreted as the user “being done”
Overall smoothness of barge-in and turn detection

Deepgram Ears

Deepgram Ears integrates Deepgram’s real-time Speech-to-Text engine into your workflow.
Depending on the selected model, different settings become available — Flux models expose turn-taking and latency controls, while Nova models expose formatting and transcription options.

Main Settings

Model

Select which Deepgram model to use. Options:

Flux General (English Only)
Real-time model optimized for speed and responsive turn-taking. English only.
Nova 3 General
High-accuracy multilingual model.
Nova 3 Medical (English Only)
Medical-tuned model for clinical terminology.

Opt in to Deepgram MIP

Enable Deepgram’s Model Improvement Program.
Your audio may be used by Deepgram to improve future models (per their policy).

Flux Model Settings

Shown when Flux General is selected Flux exposes Deepgram’s real-time turn-taking controls, allowing you to tune how aggressively or conservatively the system decides a user has finished speaking.

End of Turn Threshold

Controls how confident the model must be that the speaker has finished talking (range 0.5–0.9).

Lower values (~0.5–0.6): Faster responses, but increased risk of cutting the user off.
Higher values (~0.7–0.9): Safer, more reliable detection, but slightly slower responses.

Recommended: 0.6–0.7

End of Turn Timeout

Maximum allowed silence before forcing a turn end (1–10 seconds). Flux uses semantic detection first; this timeout only triggers if semantic detection is uncertain. Recommended: ~3 seconds (a good safety net without creating long pauses)

Eager Response Mode

Controls whether the agent begins preparing its response before Deepgram is fully certain the user has finished speaking. Options:

Off – Most conservative. Waits for high confidence.
Conservative – Starts preparing slightly earlier, still cautious.
Balanced – Good mix of responsiveness and safety.
Aggressive – Fastest responses; may occasionally interrupt if the user pauses mid-sentence.

Use this to tune the speed/safety trade-off based on your use case.

Nova Model Settings

Shown when Nova 3 General or Nova 3 Medical is selected Nova models focus on transcription quality and formatting, rather than turn detection controls.

Language

Nova 3 General: Choose Auto (multilingual) or any of 35+ supported languages.
Nova 3 Medical: Only English is available.

Punctuation

Automatically inserts punctuation marks into the transcript.

Filler Words

Includes utterances like “uh”, “um”, and similar fillers.
Disable if you want cleaner transcripts; enable if fillers are important for intent or analysis.

Smart Format

Automatically formats structured content such as dates, times, addresses, or phone numbers

Profanity Filter

Masks or filters out offensive language.

Numerals

Converts written-out numbers into numeric digits.
Example: “one hundred twenty three” → “123”

ElevenLabs Ears

ElevenLabs Ears uses ElevenLabs’ Scribe realtime speech-to-text engine, offering fast and accurate multilingual transcription. This node allows you to configure the model, language behavior, and optional audio-event tagging.

Main Settings

Model

Select the ElevenLabs Scribe model. Options:

Scribe V2 Realtime (Latest) — the latest realtime model with improved accuracy and latency.

(This is currently the only available model option.)

Language

Choose a specific language or let ElevenLabs automatically detect it. Options:

Auto (detect language) — ElevenLabs will automatically identify the spoken language.
Specific language — e.g., English (US), Spanish, French, German, etc. (full list available in the dropdown).
Custom (enter code) — manually provide a language code for advanced or unlisted languages.
Not Selected (legacy) — only for compatibility with older configurations.

Guidance:
Use Auto for global workflows or uncertain language scenarios; choose a specific language when accuracy is the priority; use Custom for languages not explicitly listed.

Custom Language Code

Only visible when Language = Custom Enter an ISO 639-1 or BCP-47 language code to explicitly control transcription behavior. Examples:

en — English
es-MX — Spanish (Mexico)
pt-BR — Portuguese (Brazil)

You can also insert runtime variables using @variableName if the language should be determined dynamically during the workflow.

Tag Audio Events

Whether to include markers for non-speech sounds in the transcript. Examples of audio events:

(laughter)
(cough)
(music)

On: Includes contextual tags in the transcription.
Off: Produces clean text only, without annotations.

OpenAI Ears

OpenAI Ears provides streaming speech-to-text powered by the GPT-4o family of transcription models.
It supports optional prompting, language control, and microphone-optimized noise reduction.

Main Settings

Prompt Mode

Choose whether to supply a prompt that guides transcription output. Prompts allow you to bias the model toward correct terminology, formatting, or stylistic rules. Options:

No Prompt — Default behavior with no custom guidance
Custom Prompt — Enables the Prompt field

Prompt (shown only when Custom Prompt is selected)

A text prompt that shapes how the transcription is produced. Prompts can help with:

Correcting domain-specific words or acronyms
Preserving punctuation or enforcing specific writing styles
Keeping filler words when desired
Choosing language variants (e.g., simplified vs. traditional Chinese)
Providing contextual hints for better continuity

Note: Custom prompts are only supported with GPT-4o models (not Mini).

Model

Choose which OpenAI speech-to-text model to use. Options:

GPT-4o Mini Transcribe — Fastest and lowest cost; ideal for real-time streaming
GPT-4o Transcribe — Higher accuracy, still optimized for real-time use

Language

Select the language for transcription.

Auto (Detect) automatically identifies the spoken language
Manual selection includes 50+ supported languages
Default is English

Selecting the language improves accuracy when the expected language is known.

Noise Reduction Type

Optimizes transcription for the caller’s microphone environment. Options:

Far Field — Best for microphones farther from the speaker
- Typical for browser calls or laptop/desktop microphones
Near Field — Best for close-talk microphones
- Phone-to-ear, headset boom mics
- All SIP calls automatically use Near Field

Choosing the correct mode improves clarity and reduces background noise.

Introduction

Organizations

Projects

Workflow Builder

Nodes

Telephony

FAQ

​Breez Ears

​Main Settings

​Model

​Language Hints (Echo only)

​Language (Wave only)

​Context (Optional) (Echo only)

​Server EOU Mode (Wave only)

​Endpoint Detection Mode (Echo only)

​Deepgram Ears

​Main Settings

​Model

​Opt in to Deepgram MIP

​Flux Model Settings

​End of Turn Threshold

​End of Turn Timeout

​Eager Response Mode

​Nova Model Settings

​Language

​Punctuation

​Filler Words

​Smart Format

​Profanity Filter

​Numerals

​ElevenLabs Ears

​Main Settings

​Model

​Language

​Custom Language Code

​Tag Audio Events

​OpenAI Ears

​Main Settings

​Prompt Mode

​Prompt (shown only when Custom Prompt is selected)

​Model

​Language

​Noise Reduction Type

Breez Ears

Main Settings

Model

Language Hints (Echo only)

Language (Wave only)

Context (Optional) (Echo only)

Server EOU Mode (Wave only)

Endpoint Detection Mode (Echo only)

Deepgram Ears

Main Settings

Model

Opt in to Deepgram MIP

Flux Model Settings

End of Turn Threshold

End of Turn Timeout

Eager Response Mode

Nova Model Settings

Language

Punctuation

Filler Words

Smart Format

Profanity Filter

Numerals

ElevenLabs Ears

Main Settings

Model

Language

Custom Language Code

Tag Audio Events

OpenAI Ears

Main Settings

Prompt Mode

Prompt (shown only when Custom Prompt is selected)

Model

Language

Noise Reduction Type