14 — Knowledge Base and RAG

Vocal10n uses domain knowledge in three places:

  1. STT term files — improve recognition of domain jargon by feeding terms into Whisper’s initial_prompt.
  2. Translation glossary / corrector — replace or annotate terms in the source text before translation.
  3. Hallucination filter list — drop known-bad Whisper outputs.

All three are managed from the KB tab (vocal10n.ui.tabs.kb_tab).

STT Term Files

Source: vocal10n.stt.filters + UI widget term_file_list.py.

  • Files live under stt_terms/ (e.g. context_gaming.txt).
  • One term per line. Comments allowed with #.
  • Multiple files can be selected simultaneously; the union is fed to initial_prompt capped at stt.initial_prompt_capacity (default 200).
  • The KB tab provides drag-drop add, an inline editor for the active file, and a status display showing current capacity utilisation (commit 6a45a0c).

A separate “phonetic index” file (also under stt_terms/) is used by the post-recognition phonetic corrector for fuzzy domain-term matching.

Translation Glossary

Source: vocal10n.llm.corrector + vocal10n.llm.rag.

  • Default file: knowledge_base/glossary_general.txt.
  • Format: one entry per line, key = canonical translation or key -> canonical (the corrector accepts both).
  • Behaviour:
    • Small glossaries (≤ translation.rag_threshold entries, default 100): the entire glossary is embedded inline in the prompt.
    • Large glossaries: rag.py embeds entries with a small local embedding model and retrieves only the top-K relevant ones per translation call (commit d34045f).

Multiple knowledge bases can be mounted from the Translation tab; they are concatenated with deduplication.

Filter List

config/filters.txt is a plain list of regex / literal phrases that the hallucination filter drops. Editable in-app via vocal10n.ui.widgets.filter_list_editor (commit 890b782). Common entries cover Whisper’s well-known boilerplate hallucinations on silence (thank-you-for-watching variants, watermark transcripts, etc.).

Why These Are Separate

The STT prompt list and the translation glossary serve different goals:

  • The STT list biases acoustic decoding so jargon survives the transcription step at all. It must stay short to avoid prompt pollution and is capped accordingly.
  • The translation glossary enforces canonical wording in the target language. It can be much larger and is what RAG was added for.

Keeping them in distinct files lets each be tuned without affecting the other. The KB tab consolidates them visually so users have one place to go.