09 β€” TTS Modules

Source: src/vocal10n/tts/.

Vocal10n supports two interchangeable TTS backends, both run as separate subprocesses and accessed over local HTTP:

  • GPT-SoVITS (default, mature) β€” voice cloning via vendor/GPT-SoVITS/api_v2.py.
  • Qwen3-TTS β€” newer model with three voice modes: clone, built-in speaker, or natural-language voice design.

Only one backend is active at a time. The UI’s tts_container_tab swaps between them; status flags live in SystemState.tts_status and SystemState.tts_qwen3_status.

Common Pipeline

flowchart LR
    Ev{{"TRANSLATION_CONFIRMED"}} --> Q["TTSQueue"]
    Q --> HTTP["HTTP request"]
    HTTP --> Bytes["audio bytes"]
    Bytes --> Player["AudioPlayer<br/>(sounddevice)"]
    Player --> Spk(["speakers"])
    Player --> Timeline["PlaybackTimeline<br/>(reference for AEC)"]

TTSQueue (queue.py) enforces:

  • tts_queue_max_size β€” hard cap; further enqueues block.
  • tts_queue_max_pending β€” drop-oldest threshold to keep latency in the configured budget. Stale text from the start of a long burst is dropped in favour of recent text.

Two-tier playback was introduced in commit e9d4556 (Phase 5 TTS overhaul). The first tier streams audio chunks as they arrive from the backend, giving low TTFA (time-to-first-audio); the second tier crossfades between consecutive chunks via audio_output.crossfade_ms to mask boundaries.

GPT-SoVITS Backend

  • server_manager.py launches vendor/GPT-SoVITS/api_v2.py from venv_tts (typically the launch script does this; the manager handles re-launches).
  • client.py is the HTTP client. Reference audio path is resolved to an absolute path before the call (commit 4a19037 fixed a long cold-start bug caused by relative paths).
  • Warm-up: when the server reports ready, the controller fires a tiny synthesis request to force the model graphs onto the GPU. Done in a background thread so the UI stays responsive (commit 142f915).
  • tts_qwen3 keys are ignored on this path; only the tts.* block is consulted.

Qwen3-TTS Backend

  • qwen3_server.py runs the Qwen3-TTS model in its own subprocess. Stdout is reserved for a line-delimited binary protocol; stderr is drained separately so log output does not corrupt the channel (commit 2311042).
  • qwen3_client.py implements the protocol.
  • qwen3_controller.py mirrors the GPT-SoVITS controller surface so the rest of the app (queue, playback, latency tracker) is unchanged.

voice_mode selects between clone, speaker, and design. The synthesis tab updates tts_qwen3.voice_mode and the relevant subkey when the user changes selection. Simple mode forces clone with the configured reference audio (commit fdf978c).

Audio Output

audio_output.py owns a single sounddevice.OutputStream per active device. It supports:

  • Selecting any output device exposed by PortAudio (the UI deduplicates devices that appear with multiple host APIs β€” commit 74be837).
  • Routing source-TTS and target-TTS streams to different devices when both tts_source_enabled and tts_target_enabled are set, so a user can pipe target language to a virtual cable for streaming while keeping source audio on the local speakers.
  • Reporting the actual TTFA (time-to-first-audio) back to the latency tracker.

Source vs. Target TTS

There are two TTS feeds, controlled by independent flags:

  • Target TTS (default on). Speaks the translation. With STT enabled this is the simultaneous-interpretation use case; with STT disabled it becomes a TTS sandbox driven by manual text.
  • Source TTS (default off). Speaks the corrected source text. With STT enabled this is effectively a voice changer; with STT disabled it is a plain TTS read-aloud.

When both are on the user is expected to route them to different output devices. The UI surfaces this as a hint in the TTS tab.