05 — Configuration

All runtime configuration lives in config/default.yaml. It is loaded once by vocal10n.config.get_config() which exposes a small Config object supporting dotted-key access (cfg.get("stt.model_size")) and section views (cfg.section("stt")).

This chapter is a reference for every section. Defaults shown reflect the shipped values.

`pipeline`

Top-level switches and pacing.

Key	Default	Meaning
`name`	`"Vocal10n"`	Display name.
`target_latency_ms`	`2500`	Soft end-to-end latency target.
`enable_stt` / `enable_translation` / `enable_tts`	`false`	Module toggles at startup. UI may flip these.
`enable_pending_translation`	`true`	Translate uncommitted text for display only.
`enable_confirmed_translation`	`true`	Translate committed text for TTS / files.
`tts_source`	`"confirmed"`	Which text feeds TTS: `confirmed`, `pending`, or `both`.
`translation_debounce_ms`	`150`	Debounce window for partial-translation calls.
`confirmed_batch_delay_ms`	`400`	Delay before flushing confirmed text into a batch.
`tts_queue_max_size`	`10`	Hard cap on queued TTS jobs.
`tts_queue_max_pending`	`3`	Drop-oldest threshold to keep latency bounded.
`max_buffer_age_s`	`2.0`	Max age before an unconfirmed buffer is force-flushed.
`min_clause_chars`	`8`	Minimum clause length before clause-end triggers translation.

`stt` — FasterWhisper

Key	Default	Meaning
`model_size`	`large-v3-turbo`	HF model id or local path.
`device`	`cuda`	Passed to `WhisperModel`.
`compute_type`	`int8_float16`	Mixed-precision compute mode.
`window_seconds`	`6.5`	Sliding decode window.
`confirm_threshold`	`0.3`	Time tail (s) below which segments stay “pending”.
`min_transcribe_duration`	`0.3`	Minimum audio length before a transcribe call.
`max_segment_age`	`4.0`	Force-confirm any segment older than this.
`sample_rate`	`16000`	Mic capture rate.
`channels` / `chunk_duration`	`1` / `0.2`	Capture chunking.
`language`	`null`	`null` = auto-detect; or `"zh"`, `"en"`.
`use_simplified_chinese`	`true`	Convert traditional output to simplified.
`initial_prompt_capacity`	`200`	Cap on terms injected via `initial_prompt`.
`beam_size`	`1`	Greedy by default for speed.

`translation` — Qwen3 / OpenAI-compatible

Key	Default	Meaning
`backend`	`local`	`local` = llama-cpp GGUF, `api` = OpenAI-compatible HTTP.
`model_path`	`models/llm/Qwen3-4B-Instruct-2507.Q4_K_M.gguf`	Used when `backend=local`.
`n_gpu_layers` / `n_ctx` / `n_batch` / `n_threads`	`-1`, `512`, `8`, `4`	llama.cpp tuning.
`api_url` / `api_model` / `api_key` / `api_timeout`	local LM Studio defaults	Used when `backend=api`.
`temperature` / `top_k` / `top_p` / `max_tokens`	`0.0`, `1`, `1.0`, `64`	Deterministic short outputs.
`target_latency_ms`	`200`	Soft per-call budget.
`target_language`	`English`	Display language; mapped to code via `languages`.
`auto_detect_source`	`true`	Detect source per call rather than relying on STT lang.
`context_window_size`	`2`	Number of prior translation pairs prepended for context.
`rag_threshold`	`100`	Switch to vector retrieval when glossary exceeds this.

`tts` — GPT-SoVITS

Key	Default	Meaning
`api_host` / `api_port` / `api_timeout`	`127.0.0.1`, `9880`, `60`	HTTP endpoint of the server subprocess.
`ref_audio_path` / `ref_audio_text` / `ref_audio_lang`	reference clip + transcript + `auto`	Voice cloning reference.
`output_lang`	`en`	Synthesis language code.
`streaming_mode`	`3`	SoVITS streaming chunk size preset.
`speed_factor`	`1.3`	Playback speed scaler.
`top_k` / `top_p` / `temperature`	`5`, `0.7`, `0.5`	Sampling.
`text_split_method`	`cut0`	Server-side chunking strategy.
`batch_size`	`1`	Per-request batch.

`tts_qwen3` — Qwen3-TTS Backend

Voice modes:

`voice_mode`	Required keys
`clone`	`ref_audio_path`, `ref_audio_text`, `ref_audio_lang`
`speaker`	`speaker` (built-in id) and optional `speaker_instruct`
`design`	`design_instruct` (free-form description)

Other parameters mirror typical generation knobs (top_k, top_p, temperature, max_new_tokens, dtype, use_flash_attn).

`audio_output`

Playback device, sample rate, buffer size, crossfade in milliseconds. The crossfade smooths the boundary between consecutive TTS chunks emitted by the streaming player.

`aec` — Acoustic Echo Cancellation

Key	Default	Meaning
`enabled`	`true`	Master switch.
`filter_taps`	`2048`	NLMS length. At 16 kHz this is 128 ms of impulse response.
`step_size`	`0.01`	NLMS μ; 0.005–0.05 is the safe range.
`dt_threshold`	`3.0`	Double-talk gate; freeze adaptation when mic ≫ echo estimate.
`max_delay_ms`	`300.0`	Max delay searched by cross-correlation.

See 07 — STT Module for theory.

`languages`

Display-name → ISO code map used by language pickers.

`obs`

Overlay server bind, per-language font family, font size, colour, stroke, and shadow. The browser source URL is http://127.0.0.1:5124/.

`output`

Per-format toggles: save_source_txt, save_source_srt, save_target_txt, save_target_srt, save_wav, plus the destination directory.

`logging`

level (INFO, DEBUG, …) and the show_latency / show_vram flags that toggle the metrics surfaces.

05 — Configuration

pipeline

stt — FasterWhisper

translation — Qwen3 / OpenAI-compatible

tts — GPT-SoVITS

tts_qwen3 — Qwen3-TTS Backend

audio_output

aec — Acoustic Echo Cancellation

languages

obs

output

logging