Real-time · Local · Voice-cloned

Speak in one language.
Hear another.

Vocal10n is an open-source, fully local speech translation pipeline.
STT → LLM translation → cloned-voice TTS — under 3 seconds on a single GPU.

Download Documentation Discover More Donate

<3s End-to-end latency

12 GB Single GPU

100% Local — no cloud

Pipeline

How It Works

Five stages, all running locally on your GPU — from microphone to cloned voice in under three seconds.

Speak

Your microphone captures audio at 16 kHz. Acoustic echo cancellation removes TTS playback before it reaches the recogniser.

Transcribe

FasterWhisper large-v3-turbo converts speech to text in real-time using a 6.5-second sliding decode window.

Translate

Qwen3-4B (local GGUF via llama.cpp) corrects punctuation, injects glossary terms, and translates to the target language.

Synthesise

GPT-SoVITS generates cloned-voice speech from a short reference clip. Qwen3-TTS is available as an alternative backend.

Output

Translated audio plays back. Live subtitles appear in OBS. Paired SRT and WAV files are saved to disk.

Capabilities

Everything in one pipeline

All components run locally — no data leaves your machine.

FasterWhisper

Real-time STT

FasterWhisper large-v3-turbo with a 6.5 s sliding decode window, hallucination filtering, phonetic dedup, and automatic language detection.

Qwen3-4B

Local LLM Translation

Qwen3-4B-Instruct GGUF via llama.cpp — no internet required. OpenAI-compatible HTTP API supported as an opt-in alternative backend.

GPT-SoVITS

Voice-Cloned TTS

GPT-SoVITS clones your voice from a short reference clip and speaks translated text back. Qwen3-TTS is available as an alternative.

Browser Source

OBS Integration

A built-in HTTP subtitle server streams live text to an OBS Browser Source. Partial and confirmed translations update in near real-time.

Knowledge Base

Glossary & RAG

Inject domain-specific terms directly into the STT prompt and LLM context. Vector retrieval kicks in automatically for large glossaries.

RTX 3060 12 GB

Sub-3s Latency

All three models coexist on a 12 GB GPU (~9.5 GB combined). Carefully tuned VRAM budget with configurable debounce and batch pacing.

Hardware

System Requirements

Vocal10n runs entirely offline on a single Windows machine.
No data leaves your system.

All models run locally. Zero cloud dependency.

Operating System	Windows 10 / 11
GPU	NVIDIA RTX 3060 12 GB or better
CUDA	CUDA Toolkit 12.x
Python	Python 3.11
VRAM Required	~9.5 GB combined (all 3 models)
Disk Space	~15 GB (models + dependencies)

Quickstart

Getting Started

Up and running in three steps.

Clone & Set Up Environments

Clone the repository and run the setup script. It creates two virtual environments automatically.

powershell

git clone https://github.com/itsLittleKevin/Vocal10n.git
cd Vocal10n
.\setup_env.ps1

Add Your Models

Download the three model files and place them in the correct subdirectories under models/.

text

models/
├── llm/   ← Qwen3-4B-Instruct-2507.Q4_K_M.gguf
├── stt/   ← FasterWhisper large-v3-turbo (auto-downloaded)
└── tts/   ← GPT-SoVITS weights

Launch

Start the full pipeline with a single command. The UI will open and guide you through selecting your audio device.