18 — Roadmap and Open Items

Items below are derived from simple_ui_plan.md, simple_ui_validation.md, deferred Phase markers in commit history, and TODO-shaped gaps in the source.

Tracked

  • Simple-mode validation matrix completion. simple_ui_validation.md lists scenarios still marked unchecked (“Execute validation matrix and final QA”). Run all combinations before tagging a release.
  • Persisted Simple-mode preferences. Phase 3 of the Simple-mode plan calls for persisting the last-used Simple mode and selections; partial today.

Planned (Not Yet Finished)

Training Pipeline

The Training tab (vocal10n.ui.tabs.training_tab) and output/training_data/ directory exist as placeholders; the actual tooling is not wired. Planned scope:

  • Dataset curation from session output. Promote selected *_source.srt + matching WAV pairs (and target SRTs where useful) from output/ into a labelled dataset under training/, with a per-clip approve / reject / edit flow in the Training tab.
  • STT correction loop. Capture user edits to the source transcript and emit (audio, corrected text) pairs to feed back into the hallucination filter list, term files, and (eventually) Whisper fine-tuning.
  • Glossary mining. Surface frequently mistranslated source spans so the user can promote them into knowledge_base/glossary_general.txt in one click.
  • Voice-clone reference builder. Cut clean reference clips from recorded sessions to feed reference_audio/ and the GPT-SoVITS / Qwen3-TTS reference paths.
  • Optional fine-tuning hooks. Thin wrappers that hand a curated dataset to upstream training scripts (FasterWhisper / GPT-SoVITS) in their respective venvs. Training itself remains out-of-process.

Scalable Deployment

Today everything assumes one Windows workstation with one GPU. Planned:

  • Remote backend split. Allow STT, LLM, and TTS to each point at a remote host instead of 127.0.0.1. The HTTP boundaries already exist for TTS and (optionally) LLM — STT needs the same treatment.
  • Multi-GPU sharding. Per-module device selection in config/default.yaml so a 24 GB-class machine can pin Whisper, Qwen3 and SoVITS to different GPUs explicitly.
  • Containerisation. Reusable images for the main app and each TTS backend, leveraging the Dockerfiles already vendored under vendor/GPT-SoVITS/. Compose file to wire the three services together.
  • Headless / API mode. A non-Qt entry point that exposes the pipeline over HTTP / WebSocket so multiple thin clients (browser, OBS-only, mobile) can share one backend.
  • Cross-platform packaging. Linux support for the headless mode first; full GUI parity later.

Other Likely Improvements

  • Configurable language pair list in Simple mode. Today it mirrors the languages config block; surfacing a smaller curated set would reduce decision load.
  • Backend-aware warm-up budgets. Stage timeouts are set in code; exposing them in config/default.yaml would let users tune for slower hardware without editing source.
  • AEC self-test. A short routine that plays a known signal and reports the converged echo-path estimate / residual would help users validate their setup.
  • Output tab presets. “Subtitles only”, “Subtitles + audio”, “Everything” presets that flip the five output.* flags together.

Deferred (Out of Scope for Current Phase)

  • Multi-source language auto-switching during a single session.
  • Hot-swap of the LLM backend without losing translation context.
  • Full Windows-parity GUI on macOS.

Notes for Future Contributors

  • Keep the module-controller-tab boundary intact. Tabs should not call engines directly; they should call the controller, which talks to SystemState and the dispatcher.
  • New backends (e.g. another TTS engine) should mirror the qwen3_* triple: *_server.py, *_client.py, *_controller.py, plus a tab module that the container tab can swap in.
  • New event types must be added to vocal10n.constants.EventType and the dispatcher must remain free of cycles. The current set forms a DAG: STT → Translation → TTS / Files / OBS.
  • All long-running operations belong on a worker thread; the Qt event loop must remain responsive at all times. The Simple-mode staged startup is the canonical example.