14 — Knowledge Base and RAG
Vocal10n uses domain knowledge in three places:
- STT term files — improve recognition of domain jargon by feeding
terms into Whisper’s
initial_prompt. - Translation glossary / corrector — replace or annotate terms in the source text before translation.
- Hallucination filter list — drop known-bad Whisper outputs.
All three are managed from the KB tab (vocal10n.ui.tabs.kb_tab).
STT Term Files
Source: vocal10n.stt.filters + UI widget term_file_list.py.
- Files live under
stt_terms/(e.g.context_gaming.txt). - One term per line. Comments allowed with
#. - Multiple files can be selected simultaneously; the union is fed to
initial_promptcapped atstt.initial_prompt_capacity(default 200). - The KB tab provides drag-drop add, an inline editor for the active
file, and a status display showing current capacity utilisation
(commit
6a45a0c).
A separate “phonetic index” file (also under stt_terms/) is used by
the post-recognition phonetic corrector for fuzzy domain-term matching.
Translation Glossary
Source: vocal10n.llm.corrector + vocal10n.llm.rag.
- Default file:
knowledge_base/glossary_general.txt. - Format: one entry per line,
key = canonical translationorkey -> canonical(the corrector accepts both). - Behaviour:
- Small glossaries (≤
translation.rag_thresholdentries, default 100): the entire glossary is embedded inline in the prompt. - Large glossaries:
rag.pyembeds entries with a small local embedding model and retrieves only the top-K relevant ones per translation call (commitd34045f).
- Small glossaries (≤
Multiple knowledge bases can be mounted from the Translation tab; they are concatenated with deduplication.
Filter List
config/filters.txt is a plain list of regex / literal phrases that the
hallucination filter drops. Editable in-app via
vocal10n.ui.widgets.filter_list_editor (commit 890b782). Common
entries cover Whisper’s well-known boilerplate hallucinations on silence
(thank-you-for-watching variants, watermark transcripts, etc.).
Why These Are Separate
The STT prompt list and the translation glossary serve different goals:
- The STT list biases acoustic decoding so jargon survives the transcription step at all. It must stay short to avoid prompt pollution and is capped accordingly.
- The translation glossary enforces canonical wording in the target language. It can be much larger and is what RAG was added for.
Keeping them in distinct files lets each be tuned without affecting the other. The KB tab consolidates them visually so users have one place to go.