08 — LLM Translation Module

Source: src/vocal10n/llm/.

Responsibilities

The LLM stage performs two distinct jobs on the transcript stream:

Source correction. Takes raw STT text, applies glossary / RAG substitutions, and (optionally) restores punctuation.
Translation. Takes the corrected source text and produces the target-language version, optionally conditioned on the previous context_window_size translation pairs.

A single Qwen3-4B instance handles both calls. The model is small enough that the per-call overhead at n_ctx=512 is well under the translation.target_latency_ms budget.

Backends

engine.py wraps llama-cpp-python for the local GGUF backend. api_backend.py wraps any OpenAI-compatible HTTP server (LM Studio, Ollama with the OpenAI shim, vLLM, OpenAI itself) using translation.api_* config keys. The active backend is selected by translation.backend.

The LLM tab can flip backends at runtime; the controller hot-swaps the implementation while preserving prompt and KB state.

Prompt Format

Prompts are ChatML-style messages ending in an assistant cue, which empirically gave the most stable JSON-free outputs from Qwen3-4B-Instruct in this size class. The prompt builder lives in translator.py and includes:

A system message with the source/target language pair and behavioural rules (“Translate, do not explain. Preserve proper nouns. Use the glossary when applicable.”).
An optional glossary block — either inline (when small) or retrieved by rag.py (when the glossary exceeds rag_threshold terms).
A short context window of recent (source, target) pairs to keep pronoun and tense choices consistent across utterances.
The current source text.

Generation defaults (temperature=0.0, top_k=1, top_p=1.0, max_tokens=64) intentionally produce short, deterministic outputs.

Corrector

corrector.py runs before translation when enabled:

Replaces glossary keys in the source text with their canonical form.
Optionally invokes the LLM with a “punctuate and clean up” prompt for the rare confirmed segment that arrived without punctuation.

It is a separate path so that translation calls remain short and predictable.

RAG

rag.py is the vector-retrieval implementation that activates when the mounted glossary has more than translation.rag_threshold terms. It:

Embeds glossary terms with a small local embedding model.
On each translation call, embeds the source text and retrieves the top-K relevant entries to inject into the prompt instead of the full glossary.

This keeps context windows tractable when a domain glossary is large (thousands of terms).

Controller

controller.py ties the engine together with the dispatcher:

Subscribes to STT_CONFIRMED and (optionally) the partial stream.
Debounces partial calls by pipeline.translation_debounce_ms.
Batches confirmed text using pipeline.confirmed_batch_delay_ms so that punctuation-broken Whisper segments are joined into a single translation request.
Maintains the rolling translation context (source/target pairs).
Publishes TRANSLATION_PARTIAL and TRANSLATION_CONFIRMED.
Updates SystemState.current_translation and SystemState.accumulated_translation for the UI.

Manual-input mode (no STT) is supported by routing UI text input directly to the controller through the same translation entry point; when STT is disabled the controller does not gate output behind the “confirmed” event.