18 — Roadmap and Open Items

Items below are derived from simple_ui_plan.md, simple_ui_validation.md, deferred Phase markers in commit history, and TODO-shaped gaps in the source.

Tracked

Simple-mode validation matrix completion. simple_ui_validation.md lists scenarios still marked unchecked (“Execute validation matrix and final QA”). Run all combinations before tagging a release.
Persisted Simple-mode preferences. Phase 3 of the Simple-mode plan calls for persisting the last-used Simple mode and selections; partial today.

Planned (Not Yet Finished)

Training Pipeline

The Training tab (vocal10n.ui.tabs.training_tab) and output/training_data/ directory exist as placeholders; the actual tooling is not wired. Planned scope:

Dataset curation from session output. Promote selected *_source.srt + matching WAV pairs (and target SRTs where useful) from output/ into a labelled dataset under training/, with a per-clip approve / reject / edit flow in the Training tab.
STT correction loop. Capture user edits to the source transcript and emit (audio, corrected text) pairs to feed back into the hallucination filter list, term files, and (eventually) Whisper fine-tuning.
Glossary mining. Surface frequently mistranslated source spans so the user can promote them into knowledge_base/glossary_general.txt in one click.
Voice-clone reference builder. Cut clean reference clips from recorded sessions to feed reference_audio/ and the GPT-SoVITS / Qwen3-TTS reference paths.
Optional fine-tuning hooks. Thin wrappers that hand a curated dataset to upstream training scripts (FasterWhisper / GPT-SoVITS) in their respective venvs. Training itself remains out-of-process.

Scalable Deployment

Today everything assumes one Windows workstation with one GPU. Planned:

Remote backend split. Allow STT, LLM, and TTS to each point at a remote host instead of 127.0.0.1. The HTTP boundaries already exist for TTS and (optionally) LLM — STT needs the same treatment.
Multi-GPU sharding. Per-module device selection in config/default.yaml so a 24 GB-class machine can pin Whisper, Qwen3 and SoVITS to different GPUs explicitly.
Containerisation. Reusable images for the main app and each TTS backend, leveraging the Dockerfiles already vendored under vendor/GPT-SoVITS/. Compose file to wire the three services together.
Headless / API mode. A non-Qt entry point that exposes the pipeline over HTTP / WebSocket so multiple thin clients (browser, OBS-only, mobile) can share one backend.
Cross-platform packaging. Linux support for the headless mode first; full GUI parity later.

Other Likely Improvements

Configurable language pair list in Simple mode. Today it mirrors the languages config block; surfacing a smaller curated set would reduce decision load.
Backend-aware warm-up budgets. Stage timeouts are set in code; exposing them in config/default.yaml would let users tune for slower hardware without editing source.
AEC self-test. A short routine that plays a known signal and reports the converged echo-path estimate / residual would help users validate their setup.
Output tab presets. “Subtitles only”, “Subtitles + audio”, “Everything” presets that flip the five output.* flags together.

Deferred (Out of Scope for Current Phase)

Multi-source language auto-switching during a single session.
Hot-swap of the LLM backend without losing translation context.
Full Windows-parity GUI on macOS.

Notes for Future Contributors

Keep the module-controller-tab boundary intact. Tabs should not call engines directly; they should call the controller, which talks to SystemState and the dispatcher.
New backends (e.g. another TTS engine) should mirror the qwen3_* triple: *_server.py, *_client.py, *_controller.py, plus a tab module that the container tab can swap in.
New event types must be added to vocal10n.constants.EventType and the dispatcher must remain free of cycles. The current set forms a DAG: STT → Translation → TTS / Files / OBS.
All long-running operations belong on a worker thread; the Qt event loop must remain responsive at all times. The Simple-mode staged startup is the canonical example.