09 β TTS Modules
Source: src/vocal10n/tts/.
Vocal10n supports two interchangeable TTS backends, both run as separate subprocesses and accessed over local HTTP:
- GPT-SoVITS (default, mature) β voice cloning via
vendor/GPT-SoVITS/api_v2.py. - Qwen3-TTS β newer model with three voice modes: clone, built-in speaker, or natural-language voice design.
Only one backend is active at a time. The UIβs tts_container_tab swaps
between them; status flags live in SystemState.tts_status and
SystemState.tts_qwen3_status.
Common Pipeline
flowchart LR
Ev{{"TRANSLATION_CONFIRMED"}} --> Q["TTSQueue"]
Q --> HTTP["HTTP request"]
HTTP --> Bytes["audio bytes"]
Bytes --> Player["AudioPlayer<br/>(sounddevice)"]
Player --> Spk(["speakers"])
Player --> Timeline["PlaybackTimeline<br/>(reference for AEC)"]
TTSQueue (queue.py) enforces:
tts_queue_max_sizeβ hard cap; further enqueues block.tts_queue_max_pendingβ drop-oldest threshold to keep latency in the configured budget. Stale text from the start of a long burst is dropped in favour of recent text.
Two-tier playback was introduced in commit e9d4556 (Phase 5 TTS
overhaul). The first tier streams audio chunks as they arrive from the
backend, giving low TTFA (time-to-first-audio); the second tier
crossfades between consecutive chunks via
audio_output.crossfade_ms to mask boundaries.
GPT-SoVITS Backend
server_manager.pylaunchesvendor/GPT-SoVITS/api_v2.pyfromvenv_tts(typically the launch script does this; the manager handles re-launches).client.pyis the HTTP client. Reference audio path is resolved to an absolute path before the call (commit4a19037fixed a long cold-start bug caused by relative paths).- Warm-up: when the server reports ready, the controller fires a tiny
synthesis request to force the model graphs onto the GPU. Done in a
background thread so the UI stays responsive (commit
142f915). tts_qwen3keys are ignored on this path; only thetts.*block is consulted.
Qwen3-TTS Backend
qwen3_server.pyruns the Qwen3-TTS model in its own subprocess. Stdout is reserved for a line-delimited binary protocol; stderr is drained separately so log output does not corrupt the channel (commit2311042).qwen3_client.pyimplements the protocol.qwen3_controller.pymirrors the GPT-SoVITS controller surface so the rest of the app (queue, playback, latency tracker) is unchanged.
voice_mode selects between clone, speaker, and design. The
synthesis tab updates tts_qwen3.voice_mode and the relevant subkey
when the user changes selection. Simple mode forces clone with the
configured reference audio (commit fdf978c).
Audio Output
audio_output.py owns a single sounddevice.OutputStream per active
device. It supports:
- Selecting any output device exposed by PortAudio (the UI deduplicates
devices that appear with multiple host APIs β commit
74be837). - Routing source-TTS and target-TTS streams to different devices
when both
tts_source_enabledandtts_target_enabledare set, so a user can pipe target language to a virtual cable for streaming while keeping source audio on the local speakers. - Reporting the actual TTFA (time-to-first-audio) back to the latency tracker.
Source vs. Target TTS
There are two TTS feeds, controlled by independent flags:
- Target TTS (default on). Speaks the translation. With STT enabled this is the simultaneous-interpretation use case; with STT disabled it becomes a TTS sandbox driven by manual text.
- Source TTS (default off). Speaks the corrected source text. With STT enabled this is effectively a voice changer; with STT disabled it is a plain TTS read-aloud.
When both are on the user is expected to route them to different output devices. The UI surfaces this as a hint in the TTS tab.