15 β Latency Budget and Tuning
The system targets < 2.5 s end-to-end speech-to-speech latency on the reference RTX 3060 12 GB workstation, with < 1.5 s for OBS subtitles.
Budget Breakdown
| Stage | Typical | Driven by |
|---|---|---|
| Mic capture chunk | β€ 200 ms | stt.chunk_duration |
| STT partial | 300β800 ms | stt.window_seconds, stt.beam_size, model size |
| STT confirmed | + 300β700 ms | stt.confirm_threshold, stt.max_segment_age |
| LLM translation | 100β250 ms | translation.max_tokens, n_ctx, prompt length |
| TTS TTFA | 400β1500 ms | Backend, tts.streaming_mode, batch / chunk size |
| Audio playback start | ~50 ms | audio_output.buffer_size, crossfade_ms |
Latencies are sampled live by LatencyTracker and shown in Section A2.
Knobs
STT
compute_typeβint8_float16is the default. Switching toint8drops VRAM and is a touch faster on Ampere;float16is slower but marginally more accurate.beam_sizeβ1(greedy) is the default. Higher beams add 100β300 ms with minor quality gains.window_secondsβ shorter windows reduce partial latency but hurt long-segment coherence. 6.5 s is a good trade.max_segment_ageβ was raised from 2.0 s to 4.0 s (commit53e3cbe) to avoid mid-sentence force-flushes; lower it again if you prefer twitchier confirmations.
Translation
n_ctxβ kept at 512 to keep TTFB low. Increase only if your prompts plus context plus glossary need more headroom.max_tokensβ capped at 64 to bound the worst case. Translations longer than this are rare and indicate a too-large source batch.temperature/top_kβ held at 0 / 1 for determinism.context_window_sizeβ 2 pairs is enough for pronoun stability without inflating prompt length.- Backend β switch to
apiand point at LM Studio if you have a bigger card; the controller switches transparently.
TTS
streaming_modeβ3is the SoVITS chunk-streaming mode. Lower values produce one big chunk (higher TTFA, easier on the player).speed_factorβ slight speed-up (1.2β1.3) cuts perceived latency because audio finishes sooner relative to the next translation.tts_queue_max_pendingβ drop-oldest cap. Kept low (3) so a long speech burst does not back-pressure the queue.- Two-tier playback β already on by default since Phase 5
(
e9d4556), with crossfade defaulting to 50 ms.
AEC
AEC adds negligible latency in the audio path (a single per-chunk NLMS pass), but it removes the need to mute the mic during TTS, which would otherwise add hundreds of milliseconds of silence to every reply. Leave it on unless you are using a hardware-cancelling headset.
Diagnostic Workflow
- Watch Section A2 metrics during a known-length utterance.
- If STT partial is high: lower
window_secondsor check thatcompute_typeisint8_float16. - If translation is high: shorten the system prompt, drop
context_window_size, or move to the API backend. - If TTS TTFA is high: confirm warm-up succeeded (logs); check
streaming_mode; for Qwen3-TTS confirm the synth-latency preset (commit123b65e). - If the audio device under-runs: raise
audio_output.buffer_size.