01 β€” Overview

What is Vocal10n

Vocal10n is a real-time speech translation system that runs locally on a single workstation. A user speaks in one language; the system produces:

  1. Live subtitles in the source language (streaming, partial results).
  2. A corrected transcript in the source language (with punctuation).
  3. A translated transcript in the target language.
  4. Synthesised speech in the target language using voice-cloned audio.
  5. Optional OBS browser-source overlay for streaming/recording.
  6. Optional .srt, .txt, and .wav files written to output/.

It is built around three local models:

StageModelRole
STTFasterWhisper large-v3-turboStreaming speech recognition
LLMQwen3-4B-Instruct (Q4_K_M GGUF) via llama-cpp-pythonPunctuation correction + translation
TTSGPT-SoVITS (default) or Qwen3-TTSVoice-cloned speech synthesis

Design Goals

  • Low latency end-to-end. Target speech-to-speech latency is under 3 s, with sub-1.5 s OBS subtitle latency.
  • Single GPU. All three models must coexist on a 12 GB card (RTX 3060 reference). VRAM budget is roughly Whisper 2.5 GB + Qwen3 4 GB
    • GPT-SoVITS 3 GB β‰ˆ 9.5 GB with headroom.
  • Local-first. No required cloud calls. An OpenAI-compatible HTTP backend is supported as an opt-in alternative for the LLM stage.
  • Independent module toggles. STT, LLM and TTS can each be enabled or disabled independently. Disabling STT turns the app into a manual translator; disabling LLM lets STT-only output pass through; enabling only TTS turns the app into a voice changer / TTS sandbox.
  • Two UX modes. A Pro mode exposes every parameter, and a Simple mode collapses it to a one-click Start All / Stop All experience.

Target Hardware and OS

  • Windows 10 / 11.
  • NVIDIA GPU with CUDA 12.x (RTX 3060 12 GB or better recommended).
  • Python 3.11.
  • Two virtual environments are created automatically: venv_main for the application, venv_tts for the GPT-SoVITS server subprocess.

What Vocal10n is Not

  • It is not a meeting-transcription product; there is no multi-channel diarisation pipeline beyond the optional speaker tagger.
  • It is not a cloud translator wrapper; while an OpenAI-compatible backend is supported, the default flow is fully local.
  • It is not a TTS training tool; an output mode collects WAV + SRT pairs for downstream training, but training itself is out of scope.