Real-time Β· Local Β· Voice-cloned

Speak in one language.
Hear another.

Vocal10n is an open-source, fully local speech translation pipeline.
STT β†’ LLM translation β†’ cloned-voice TTS β€” under 3 seconds on a single GPU.

<3s End-to-end latency
12 GB Single GPU
100% Local β€” no cloud

Pipeline

How It Works

Five stages, all running locally on your GPU β€” from microphone to cloned voice in under three seconds.

01

Speak

Your microphone captures audio at 16 kHz. Acoustic echo cancellation removes TTS playback before it reaches the recogniser.

02

Transcribe

FasterWhisper large-v3-turbo converts speech to text in real-time using a 6.5-second sliding decode window.

03

Translate

Qwen3-4B (local GGUF via llama.cpp) corrects punctuation, injects glossary terms, and translates to the target language.

04

Synthesise

GPT-SoVITS generates cloned-voice speech from a short reference clip. Qwen3-TTS is available as an alternative backend.

05

Output

Translated audio plays back. Live subtitles appear in OBS. Paired SRT and WAV files are saved to disk.

Capabilities

Everything in one pipeline

All components run locally β€” no data leaves your machine.

FasterWhisper

Real-time STT

FasterWhisper large-v3-turbo with a 6.5 s sliding decode window, hallucination filtering, phonetic dedup, and automatic language detection.

Qwen3-4B

Local LLM Translation

Qwen3-4B-Instruct GGUF via llama.cpp β€” no internet required. OpenAI-compatible HTTP API supported as an opt-in alternative backend.

GPT-SoVITS

Voice-Cloned TTS

GPT-SoVITS clones your voice from a short reference clip and speaks translated text back. Qwen3-TTS is available as an alternative.

Browser Source

OBS Integration

A built-in HTTP subtitle server streams live text to an OBS Browser Source. Partial and confirmed translations update in near real-time.

Knowledge Base

Glossary & RAG

Inject domain-specific terms directly into the STT prompt and LLM context. Vector retrieval kicks in automatically for large glossaries.

RTX 3060 12 GB

Sub-3s Latency

All three models coexist on a 12 GB GPU (~9.5 GB combined). Carefully tuned VRAM budget with configurable debounce and batch pacing.

Hardware

System Requirements

Vocal10n runs entirely offline on a single Windows machine.
No data leaves your system.

All models run locally. Zero cloud dependency.
Operating System Windows 10 / 11
GPU NVIDIA RTX 3060 12 GB or better
CUDA CUDA Toolkit 12.x
Python Python 3.11
VRAM Required ~9.5 GB combined (all 3 models)
Disk Space ~15 GB (models + dependencies)

Quickstart

Getting Started

Up and running in three steps.

1

Clone & Set Up Environments

Clone the repository and run the setup script. It creates two virtual environments automatically.

powershell
git clone https://github.com/itsLittleKevin/Vocal10n.git
cd Vocal10n
.\setup_env.ps1
2

Add Your Models

Download the three model files and place them in the correct subdirectories under models/.

text
models/
β”œβ”€β”€ llm/   ← Qwen3-4B-Instruct-2507.Q4_K_M.gguf
β”œβ”€β”€ stt/   ← FasterWhisper large-v3-turbo (auto-downloaded)
└── tts/   ← GPT-SoVITS weights
3

Launch

Start the full pipeline with a single command. The UI will open and guide you through selecting your audio device.

powershell
.\start.ps1

Roadmap

What's coming next

Active development is continuing. Contributions welcome.

Full roadmap
  • Planned

    Training Pipeline

    Curate session output (SRT + WAV) into a labelled training dataset with an approve/reject flow.

  • Planned

    Remote Backend Split

    Point STT, LLM, and TTS at separate remote hosts. HTTP boundaries already exist for TTS and LLM.

  • Planned

    Headless / API Mode

    A non-Qt HTTP/WebSocket entry point so browsers, mobile clients, and OBS can share one backend.

  • Planned

    Multi-GPU Sharding

    Per-module device selection to pin Whisper, Qwen3, and SoVITS to different GPUs on 24 GB+ rigs.

  • Planned

    Containerisation

    Docker images for the main app and each TTS backend, with a Compose file to wire all services.