05 β Configuration
All runtime configuration lives in config/default.yaml. It is loaded
once by vocal10n.config.get_config() which exposes a small Config
object supporting dotted-key access (cfg.get("stt.model_size")) and
section views (cfg.section("stt")).
This chapter is a reference for every section. Defaults shown reflect the shipped values.
pipeline
Top-level switches and pacing.
| Key | Default | Meaning |
|---|---|---|
name | "Vocal10n" | Display name. |
target_latency_ms | 2500 | Soft end-to-end latency target. |
enable_stt / enable_translation / enable_tts | false | Module toggles at startup. UI may flip these. |
enable_pending_translation | true | Translate uncommitted text for display only. |
enable_confirmed_translation | true | Translate committed text for TTS / files. |
tts_source | "confirmed" | Which text feeds TTS: confirmed, pending, or both. |
translation_debounce_ms | 150 | Debounce window for partial-translation calls. |
confirmed_batch_delay_ms | 400 | Delay before flushing confirmed text into a batch. |
tts_queue_max_size | 10 | Hard cap on queued TTS jobs. |
tts_queue_max_pending | 3 | Drop-oldest threshold to keep latency bounded. |
max_buffer_age_s | 2.0 | Max age before an unconfirmed buffer is force-flushed. |
min_clause_chars | 8 | Minimum clause length before clause-end triggers translation. |
stt β FasterWhisper
| Key | Default | Meaning |
|---|---|---|
model_size | large-v3-turbo | HF model id or local path. |
device | cuda | Passed to WhisperModel. |
compute_type | int8_float16 | Mixed-precision compute mode. |
window_seconds | 6.5 | Sliding decode window. |
confirm_threshold | 0.3 | Time tail (s) below which segments stay βpendingβ. |
min_transcribe_duration | 0.3 | Minimum audio length before a transcribe call. |
max_segment_age | 4.0 | Force-confirm any segment older than this. |
sample_rate | 16000 | Mic capture rate. |
channels / chunk_duration | 1 / 0.2 | Capture chunking. |
language | null | null = auto-detect; or "zh", "en". |
use_simplified_chinese | true | Convert traditional output to simplified. |
initial_prompt_capacity | 200 | Cap on terms injected via initial_prompt. |
beam_size | 1 | Greedy by default for speed. |
translation β Qwen3 / OpenAI-compatible
| Key | Default | Meaning |
|---|---|---|
backend | local | local = llama-cpp GGUF, api = OpenAI-compatible HTTP. |
model_path | models/llm/Qwen3-4B-Instruct-2507.Q4_K_M.gguf | Used when backend=local. |
n_gpu_layers / n_ctx / n_batch / n_threads | -1, 512, 8, 4 | llama.cpp tuning. |
api_url / api_model / api_key / api_timeout | local LM Studio defaults | Used when backend=api. |
temperature / top_k / top_p / max_tokens | 0.0, 1, 1.0, 64 | Deterministic short outputs. |
target_latency_ms | 200 | Soft per-call budget. |
target_language | English | Display language; mapped to code via languages. |
auto_detect_source | true | Detect source per call rather than relying on STT lang. |
context_window_size | 2 | Number of prior translation pairs prepended for context. |
rag_threshold | 100 | Switch to vector retrieval when glossary exceeds this. |
tts β GPT-SoVITS
| Key | Default | Meaning |
|---|---|---|
api_host / api_port / api_timeout | 127.0.0.1, 9880, 60 | HTTP endpoint of the server subprocess. |
ref_audio_path / ref_audio_text / ref_audio_lang | reference clip + transcript + auto | Voice cloning reference. |
output_lang | en | Synthesis language code. |
streaming_mode | 3 | SoVITS streaming chunk size preset. |
speed_factor | 1.3 | Playback speed scaler. |
top_k / top_p / temperature | 5, 0.7, 0.5 | Sampling. |
text_split_method | cut0 | Server-side chunking strategy. |
batch_size | 1 | Per-request batch. |
tts_qwen3 β Qwen3-TTS Backend
Voice modes:
voice_mode | Required keys |
|---|---|
clone | ref_audio_path, ref_audio_text, ref_audio_lang |
speaker | speaker (built-in id) and optional speaker_instruct |
design | design_instruct (free-form description) |
Other parameters mirror typical generation knobs (top_k, top_p,
temperature, max_new_tokens, dtype, use_flash_attn).
audio_output
Playback device, sample rate, buffer size, crossfade in milliseconds. The crossfade smooths the boundary between consecutive TTS chunks emitted by the streaming player.
aec β Acoustic Echo Cancellation
| Key | Default | Meaning |
|---|---|---|
enabled | true | Master switch. |
filter_taps | 2048 | NLMS length. At 16 kHz this is 128 ms of impulse response. |
step_size | 0.01 | NLMS ΞΌ; 0.005β0.05 is the safe range. |
dt_threshold | 3.0 | Double-talk gate; freeze adaptation when mic β« echo estimate. |
max_delay_ms | 300.0 | Max delay searched by cross-correlation. |
See 07 β STT Module for theory.
languages
Display-name β ISO code map used by language pickers.
obs
Overlay server bind, per-language font family, font size, colour, stroke,
and shadow. The browser source URL is http://127.0.0.1:5124/.
output
Per-format toggles: save_source_txt, save_source_srt,
save_target_txt, save_target_srt, save_wav, plus the destination
directory.
logging
level (INFO, DEBUG, β¦) and the show_latency / show_vram flags
that toggle the metrics surfaces.