Speak in one language.
Hear another.
Vocal10n is an open-source, fully local speech translation pipeline.
STT β LLM translation β cloned-voice TTS β under 3 seconds on a single GPU.
Pipeline
How It Works
Five stages, all running locally on your GPU β from microphone to cloned voice in under three seconds.
Speak
Your microphone captures audio at 16 kHz. Acoustic echo cancellation removes TTS playback before it reaches the recogniser.
Transcribe
FasterWhisper large-v3-turbo converts speech to text in real-time using a 6.5-second sliding decode window.
Translate
Qwen3-4B (local GGUF via llama.cpp) corrects punctuation, injects glossary terms, and translates to the target language.
Synthesise
GPT-SoVITS generates cloned-voice speech from a short reference clip. Qwen3-TTS is available as an alternative backend.
Output
Translated audio plays back. Live subtitles appear in OBS. Paired SRT and WAV files are saved to disk.
Capabilities
Everything in one pipeline
All components run locally β no data leaves your machine.
Real-time STT
FasterWhisper large-v3-turbo with a 6.5 s sliding decode window, hallucination filtering, phonetic dedup, and automatic language detection.
Local LLM Translation
Qwen3-4B-Instruct GGUF via llama.cpp β no internet required. OpenAI-compatible HTTP API supported as an opt-in alternative backend.
Voice-Cloned TTS
GPT-SoVITS clones your voice from a short reference clip and speaks translated text back. Qwen3-TTS is available as an alternative.
OBS Integration
A built-in HTTP subtitle server streams live text to an OBS Browser Source. Partial and confirmed translations update in near real-time.
Glossary & RAG
Inject domain-specific terms directly into the STT prompt and LLM context. Vector retrieval kicks in automatically for large glossaries.
Sub-3s Latency
All three models coexist on a 12 GB GPU (~9.5 GB combined). Carefully tuned VRAM budget with configurable debounce and batch pacing.
Hardware
System Requirements
Vocal10n runs entirely offline on a single Windows machine.
No data leaves your system.
| Operating System | Windows 10 / 11 |
| GPU | NVIDIA RTX 3060 12 GB or better |
| CUDA | CUDA Toolkit 12.x |
| Python | Python 3.11 |
| VRAM Required | ~9.5 GB combined (all 3 models) |
| Disk Space | ~15 GB (models + dependencies) |
Quickstart
Getting Started
Up and running in three steps.
Clone & Set Up Environments
Clone the repository and run the setup script. It creates two virtual environments automatically.
git clone https://github.com/itsLittleKevin/Vocal10n.git
cd Vocal10n
.\setup_env.ps1 Add Your Models
Download the three model files and place them in the correct subdirectories under models/.
models/
βββ llm/ β Qwen3-4B-Instruct-2507.Q4_K_M.gguf
βββ stt/ β FasterWhisper large-v3-turbo (auto-downloaded)
βββ tts/ β GPT-SoVITS weights Launch
Start the full pipeline with a single command. The UI will open and guide you through selecting your audio device.
.\start.ps1 - Planned
Training Pipeline
Curate session output (SRT + WAV) into a labelled training dataset with an approve/reject flow.
- Planned
Remote Backend Split
Point STT, LLM, and TTS at separate remote hosts. HTTP boundaries already exist for TTS and LLM.
- Planned
Headless / API Mode
A non-Qt HTTP/WebSocket entry point so browsers, mobile clients, and OBS can share one backend.
- Planned
Multi-GPU Sharding
Per-module device selection to pin Whisper, Qwen3, and SoVITS to different GPUs on 24 GB+ rigs.
- Planned
Containerisation
Docker images for the main app and each TTS backend, with a Compose file to wire all services.