01 — Overview

What is Vocal10n

Vocal10n is a real-time speech translation system that runs locally on a single workstation. A user speaks in one language; the system produces:

It is built around three local models:

Stage	Model	Role
STT	FasterWhisper `large-v3-turbo`	Streaming speech recognition
LLM	Qwen3-4B-Instruct (Q4_K_M GGUF) via `llama-cpp-python`	Punctuation correction + translation
TTS	GPT-SoVITS (default) or Qwen3-TTS	Voice-cloned speech synthesis

Low latency end-to-end. Target speech-to-speech latency is under 3 s, with sub-1.5 s OBS subtitle latency.
Single GPU. All three models must coexist on a 12 GB card (RTX 3060 reference). VRAM budget is roughly Whisper 2.5 GB + Qwen3 4 GB
- GPT-SoVITS 3 GB ≈ 9.5 GB with headroom.
Local-first. No required cloud calls. An OpenAI-compatible HTTP backend is supported as an opt-in alternative for the LLM stage.
Independent module toggles. STT, LLM and TTS can each be enabled or disabled independently. Disabling STT turns the app into a manual translator; disabling LLM lets STT-only output pass through; enabling only TTS turns the app into a voice changer / TTS sandbox.
Two UX modes. A Pro mode exposes every parameter, and a Simple mode collapses it to a one-click Start All / Stop All experience.

Windows 10 / 11.
NVIDIA GPU with CUDA 12.x (RTX 3060 12 GB or better recommended).
Python 3.11.
Two virtual environments are created automatically: venv_main for the application, venv_tts for the GPT-SoVITS server subprocess.

It is not a meeting-transcription product; there is no multi-channel diarisation pipeline beyond the optional speaker tagger.
It is not a cloud translator wrapper; while an OpenAI-compatible backend is supported, the default flow is fully local.
It is not a TTS training tool; an output mode collects WAV + SRT pairs for downstream training, but training itself is out of scope.