15 β€” Latency Budget and Tuning

The system targets < 2.5 s end-to-end speech-to-speech latency on the reference RTX 3060 12 GB workstation, with < 1.5 s for OBS subtitles.

Budget Breakdown

StageTypicalDriven by
Mic capture chunk≀ 200 msstt.chunk_duration
STT partial300–800 msstt.window_seconds, stt.beam_size, model size
STT confirmed+ 300–700 msstt.confirm_threshold, stt.max_segment_age
LLM translation100–250 mstranslation.max_tokens, n_ctx, prompt length
TTS TTFA400–1500 msBackend, tts.streaming_mode, batch / chunk size
Audio playback start~50 msaudio_output.buffer_size, crossfade_ms

Latencies are sampled live by LatencyTracker and shown in Section A2.

Knobs

STT

  • compute_type β€” int8_float16 is the default. Switching to int8 drops VRAM and is a touch faster on Ampere; float16 is slower but marginally more accurate.
  • beam_size β€” 1 (greedy) is the default. Higher beams add 100–300 ms with minor quality gains.
  • window_seconds β€” shorter windows reduce partial latency but hurt long-segment coherence. 6.5 s is a good trade.
  • max_segment_age β€” was raised from 2.0 s to 4.0 s (commit 53e3cbe) to avoid mid-sentence force-flushes; lower it again if you prefer twitchier confirmations.

Translation

  • n_ctx β€” kept at 512 to keep TTFB low. Increase only if your prompts plus context plus glossary need more headroom.
  • max_tokens β€” capped at 64 to bound the worst case. Translations longer than this are rare and indicate a too-large source batch.
  • temperature / top_k β€” held at 0 / 1 for determinism.
  • context_window_size β€” 2 pairs is enough for pronoun stability without inflating prompt length.
  • Backend β€” switch to api and point at LM Studio if you have a bigger card; the controller switches transparently.

TTS

  • streaming_mode β€” 3 is the SoVITS chunk-streaming mode. Lower values produce one big chunk (higher TTFA, easier on the player).
  • speed_factor β€” slight speed-up (1.2–1.3) cuts perceived latency because audio finishes sooner relative to the next translation.
  • tts_queue_max_pending β€” drop-oldest cap. Kept low (3) so a long speech burst does not back-pressure the queue.
  • Two-tier playback β€” already on by default since Phase 5 (e9d4556), with crossfade defaulting to 50 ms.

AEC

AEC adds negligible latency in the audio path (a single per-chunk NLMS pass), but it removes the need to mute the mic during TTS, which would otherwise add hundreds of milliseconds of silence to every reply. Leave it on unless you are using a hardware-cancelling headset.

Diagnostic Workflow

  1. Watch Section A2 metrics during a known-length utterance.
  2. If STT partial is high: lower window_seconds or check that compute_type is int8_float16.
  3. If translation is high: shorten the system prompt, drop context_window_size, or move to the API backend.
  4. If TTS TTFA is high: confirm warm-up succeeded (logs); check streaming_mode; for Qwen3-TTS confirm the synth-latency preset (commit 123b65e).
  5. If the audio device under-runs: raise audio_output.buffer_size.