18 — Roadmap and Open Items
Items below are derived from simple_ui_plan.md,
simple_ui_validation.md, deferred Phase markers in commit history, and
TODO-shaped gaps in the source.
Tracked
- Simple-mode validation matrix completion.
simple_ui_validation.mdlists scenarios still marked unchecked (“Execute validation matrix and final QA”). Run all combinations before tagging a release. - Persisted Simple-mode preferences. Phase 3 of the Simple-mode plan calls for persisting the last-used Simple mode and selections; partial today.
Planned (Not Yet Finished)
Training Pipeline
The Training tab (vocal10n.ui.tabs.training_tab) and
output/training_data/ directory exist as placeholders; the actual
tooling is not wired. Planned scope:
- Dataset curation from session output. Promote selected
*_source.srt+ matching WAV pairs (and target SRTs where useful) fromoutput/into a labelled dataset undertraining/, with a per-clip approve / reject / edit flow in the Training tab. - STT correction loop. Capture user edits to the source transcript and emit (audio, corrected text) pairs to feed back into the hallucination filter list, term files, and (eventually) Whisper fine-tuning.
- Glossary mining. Surface frequently mistranslated source spans so
the user can promote them into
knowledge_base/glossary_general.txtin one click. - Voice-clone reference builder. Cut clean reference clips from
recorded sessions to feed
reference_audio/and the GPT-SoVITS / Qwen3-TTS reference paths. - Optional fine-tuning hooks. Thin wrappers that hand a curated dataset to upstream training scripts (FasterWhisper / GPT-SoVITS) in their respective venvs. Training itself remains out-of-process.
Scalable Deployment
Today everything assumes one Windows workstation with one GPU. Planned:
- Remote backend split. Allow STT, LLM, and TTS to each point at a
remote host instead of
127.0.0.1. The HTTP boundaries already exist for TTS and (optionally) LLM — STT needs the same treatment. - Multi-GPU sharding. Per-module device selection in
config/default.yamlso a 24 GB-class machine can pin Whisper, Qwen3 and SoVITS to different GPUs explicitly. - Containerisation. Reusable images for the main app and each TTS
backend, leveraging the Dockerfiles already vendored under
vendor/GPT-SoVITS/. Compose file to wire the three services together. - Headless / API mode. A non-Qt entry point that exposes the pipeline over HTTP / WebSocket so multiple thin clients (browser, OBS-only, mobile) can share one backend.
- Cross-platform packaging. Linux support for the headless mode first; full GUI parity later.
Other Likely Improvements
- Configurable language pair list in Simple mode. Today it mirrors
the
languagesconfig block; surfacing a smaller curated set would reduce decision load. - Backend-aware warm-up budgets. Stage timeouts are set in code;
exposing them in
config/default.yamlwould let users tune for slower hardware without editing source. - AEC self-test. A short routine that plays a known signal and reports the converged echo-path estimate / residual would help users validate their setup.
- Output tab presets. “Subtitles only”, “Subtitles + audio”,
“Everything” presets that flip the five
output.*flags together.
Deferred (Out of Scope for Current Phase)
- Multi-source language auto-switching during a single session.
- Hot-swap of the LLM backend without losing translation context.
- Full Windows-parity GUI on macOS.
Notes for Future Contributors
- Keep the module-controller-tab boundary intact. Tabs should not
call engines directly; they should call the controller, which talks
to
SystemStateand the dispatcher. - New backends (e.g. another TTS engine) should mirror the
qwen3_*triple:*_server.py,*_client.py,*_controller.py, plus a tab module that the container tab can swap in. - New event types must be added to
vocal10n.constants.EventTypeand the dispatcher must remain free of cycles. The current set forms a DAG: STT → Translation → TTS / Files / OBS. - All long-running operations belong on a worker thread; the Qt event loop must remain responsive at all times. The Simple-mode staged startup is the canonical example.