15 — Latency Budget and Tuning

The system targets < 2.5 s end-to-end speech-to-speech latency on the reference RTX 3060 12 GB workstation, with < 1.5 s for OBS subtitles.

Budget Breakdown

Stage	Typical	Driven by
Mic capture chunk	≤ 200 ms	`stt.chunk_duration`
STT partial	300–800 ms	`stt.window_seconds`, `stt.beam_size`, model size
STT confirmed	+ 300–700 ms	`stt.confirm_threshold`, `stt.max_segment_age`
LLM translation	100–250 ms	`translation.max_tokens`, `n_ctx`, prompt length
TTS TTFA	400–1500 ms	Backend, `tts.streaming_mode`, batch / chunk size
Audio playback start	~50 ms	`audio_output.buffer_size`, `crossfade_ms`

Latencies are sampled live by LatencyTracker and shown in Section A2.

Knobs

STT

compute_type — int8_float16 is the default. Switching to int8 drops VRAM and is a touch faster on Ampere; float16 is slower but marginally more accurate.
beam_size — 1 (greedy) is the default. Higher beams add 100–300 ms with minor quality gains.
window_seconds — shorter windows reduce partial latency but hurt long-segment coherence. 6.5 s is a good trade.
max_segment_age — was raised from 2.0 s to 4.0 s (commit 53e3cbe) to avoid mid-sentence force-flushes; lower it again if you prefer twitchier confirmations.

Translation

n_ctx — kept at 512 to keep TTFB low. Increase only if your prompts plus context plus glossary need more headroom.
max_tokens — capped at 64 to bound the worst case. Translations longer than this are rare and indicate a too-large source batch.
temperature / top_k — held at 0 / 1 for determinism.
context_window_size — 2 pairs is enough for pronoun stability without inflating prompt length.
Backend — switch to api and point at LM Studio if you have a bigger card; the controller switches transparently.

TTS

streaming_mode — 3 is the SoVITS chunk-streaming mode. Lower values produce one big chunk (higher TTFA, easier on the player).
speed_factor — slight speed-up (1.2–1.3) cuts perceived latency because audio finishes sooner relative to the next translation.
tts_queue_max_pending — drop-oldest cap. Kept low (3) so a long speech burst does not back-pressure the queue.
Two-tier playback — already on by default since Phase 5 (e9d4556), with crossfade defaulting to 50 ms.

AEC

AEC adds negligible latency in the audio path (a single per-chunk NLMS pass), but it removes the need to mute the mic during TTS, which would otherwise add hundreds of milliseconds of silence to every reply. Leave it on unless you are using a hardware-cancelling headset.

Diagnostic Workflow

Watch Section A2 metrics during a known-length utterance.
If STT partial is high: lower window_seconds or check that compute_type is int8_float16.
If translation is high: shorten the system prompt, drop context_window_size, or move to the API backend.
If TTS TTFA is high: confirm warm-up succeeded (logs); check streaming_mode; for Qwen3-TTS confirm the synth-latency preset (commit 123b65e).
If the audio device under-runs: raise audio_output.buffer_size.