08 — LLM Translation Module
Source: src/vocal10n/llm/.
Responsibilities
The LLM stage performs two distinct jobs on the transcript stream:
- Source correction. Takes raw STT text, applies glossary / RAG substitutions, and (optionally) restores punctuation.
- Translation. Takes the corrected source text and produces the
target-language version, optionally conditioned on the previous
context_window_sizetranslation pairs.
A single Qwen3-4B instance handles both calls. The model is small enough
that the per-call overhead at n_ctx=512 is well under the
translation.target_latency_ms budget.
Backends
engine.py wraps llama-cpp-python for the local GGUF backend.
api_backend.py wraps any OpenAI-compatible HTTP server (LM Studio,
Ollama with the OpenAI shim, vLLM, OpenAI itself) using translation.api_*
config keys. The active backend is selected by translation.backend.
The LLM tab can flip backends at runtime; the controller hot-swaps the implementation while preserving prompt and KB state.
Prompt Format
Prompts are ChatML-style messages ending in an assistant cue, which
empirically gave the most stable JSON-free outputs from Qwen3-4B-Instruct
in this size class. The prompt builder lives in translator.py and
includes:
- A system message with the source/target language pair and behavioural rules (“Translate, do not explain. Preserve proper nouns. Use the glossary when applicable.”).
- An optional glossary block — either inline (when small) or
retrieved by
rag.py(when the glossary exceedsrag_thresholdterms). - A short context window of recent (source, target) pairs to keep pronoun and tense choices consistent across utterances.
- The current source text.
Generation defaults (temperature=0.0, top_k=1, top_p=1.0,
max_tokens=64) intentionally produce short, deterministic outputs.
Corrector
corrector.py runs before translation when enabled:
- Replaces glossary keys in the source text with their canonical form.
- Optionally invokes the LLM with a “punctuate and clean up” prompt for the rare confirmed segment that arrived without punctuation.
It is a separate path so that translation calls remain short and predictable.
RAG
rag.py is the vector-retrieval implementation that activates when the
mounted glossary has more than translation.rag_threshold terms. It:
- Embeds glossary terms with a small local embedding model.
- On each translation call, embeds the source text and retrieves the top-K relevant entries to inject into the prompt instead of the full glossary.
This keeps context windows tractable when a domain glossary is large (thousands of terms).
Controller
controller.py ties the engine together with the dispatcher:
- Subscribes to
STT_CONFIRMEDand (optionally) the partial stream. - Debounces partial calls by
pipeline.translation_debounce_ms. - Batches confirmed text using
pipeline.confirmed_batch_delay_msso that punctuation-broken Whisper segments are joined into a single translation request. - Maintains the rolling translation context (source/target pairs).
- Publishes
TRANSLATION_PARTIALandTRANSLATION_CONFIRMED. - Updates
SystemState.current_translationandSystemState.accumulated_translationfor the UI.
Manual-input mode (no STT) is supported by routing UI text input directly to the controller through the same translation entry point; when STT is disabled the controller does not gate output behind the “confirmed” event.