01 β Overview
What is Vocal10n
Vocal10n is a real-time speech translation system that runs locally on a single workstation. A user speaks in one language; the system produces:
- Live subtitles in the source language (streaming, partial results).
- A corrected transcript in the source language (with punctuation).
- A translated transcript in the target language.
- Synthesised speech in the target language using voice-cloned audio.
- Optional OBS browser-source overlay for streaming/recording.
- Optional
.srt,.txt, and.wavfiles written tooutput/.
It is built around three local models:
| Stage | Model | Role |
|---|---|---|
| STT | FasterWhisper large-v3-turbo | Streaming speech recognition |
| LLM | Qwen3-4B-Instruct (Q4_K_M GGUF) via llama-cpp-python | Punctuation correction + translation |
| TTS | GPT-SoVITS (default) or Qwen3-TTS | Voice-cloned speech synthesis |
Design Goals
- Low latency end-to-end. Target speech-to-speech latency is under 3 s, with sub-1.5 s OBS subtitle latency.
- Single GPU. All three models must coexist on a 12 GB card
(RTX 3060 reference). VRAM budget is roughly Whisper 2.5 GB + Qwen3 4 GB
- GPT-SoVITS 3 GB β 9.5 GB with headroom.
- Local-first. No required cloud calls. An OpenAI-compatible HTTP backend is supported as an opt-in alternative for the LLM stage.
- Independent module toggles. STT, LLM and TTS can each be enabled or disabled independently. Disabling STT turns the app into a manual translator; disabling LLM lets STT-only output pass through; enabling only TTS turns the app into a voice changer / TTS sandbox.
- Two UX modes. A Pro mode exposes every parameter, and a Simple mode collapses it to a one-click Start All / Stop All experience.
Target Hardware and OS
- Windows 10 / 11.
- NVIDIA GPU with CUDA 12.x (RTX 3060 12 GB or better recommended).
- Python 3.11.
- Two virtual environments are created automatically:
venv_mainfor the application,venv_ttsfor the GPT-SoVITS server subprocess.
What Vocal10n is Not
- It is not a meeting-transcription product; there is no multi-channel diarisation pipeline beyond the optional speaker tagger.
- It is not a cloud translator wrapper; while an OpenAI-compatible backend is supported, the default flow is fully local.
- It is not a TTS training tool; an output mode collects WAV + SRT pairs for downstream training, but training itself is out of scope.