07 — STT Module

Source: src/vocal10n/stt/.

Responsibilities

  • Capture microphone audio at 16 kHz mono.
  • Optionally cancel TTS playback echo before transcription.
  • Run FasterWhisper in a sliding-window streaming mode.
  • Emit partial (live) and confirmed (stable) text events.
  • Filter hallucinations and obvious repetitions.
  • Optionally tag speakers (lightweight diarisation).

Files

FileRole
audio_capture.pysounddevice input stream, ring buffer, device enumeration.
playback_aec.pyPlaybackTimeline and AdaptiveEchoCanceller (NLMS + DTD).
engine.pySTTEngine — thin FasterWhisper wrapper.
transcript.pySegment management; partial vs. confirmed.
filters.pyHallucination filter, adjacent-dedup, short-phrase repeat filter, phonetic correction.
diarizer.pyOptional speaker-id tagging.
worker.pyThread that pulls from the ring buffer and drives the engine.
controller.pyPublic API consumed by the STT tab.

Streaming Strategy

The capture layer pushes 0.2 s frames into a ring buffer. The worker runs a loop that:

  1. Reads the trailing window_seconds (default 6.5) of audio.
  2. Calls STTEngine.transcribe(). Returns SegmentResult[] with text, start, end, avg_logprob, no_speech_prob, and per-word confidences when available.
  3. Passes each segment through filters.HallucinationFilter, then adjacent-dedup and short-phrase repeat suppression to cut runaway loops Whisper sometimes produces on noise.
  4. Splits segments into:
    • Pending: segments whose end is within confirm_threshold seconds of the window tail.
    • Confirmed: segments older than the threshold or older than max_segment_age (force-flush to avoid stalling on a long unsplit utterance).
  5. Publishes STT_PARTIAL for the pending tail and STT_CONFIRMED for each freshly-stable segment.

Recognition context — the term files described in 14 — Knowledge Base and RAG — is concatenated into Whisper’s initial_prompt up to initial_prompt_capacity terms.

Hallucination Filter

filters.py carries multiple defences:

  • Static filter list. Phrases loaded from config/filters.txt (e.g. boilerplate Whisper hallucinations like “Thank you for watching”). Matched segments are dropped.
  • Adjacent dedup. A segment whose text is identical or near-identical to the previous confirmed segment is suppressed.
  • Short-phrase repeat suppression. Detects N-times-repeated short fragments (a known Whisper failure mode on silence).
  • Phonetic index (e.g. stt_terms/context_gaming.txt) — fuzzy match source tokens to expected domain terms to fix “似听非听” cases.

The full filter list is editable from the Knowledge Base tab via vocal10n.ui.widgets.filter_list_editor.

Playback-Aware AEC

When TTS is enabled, the played-back synthetic voice can re-enter the microphone and be re-transcribed, producing an echo loop. The AEC layer in playback_aec.py prevents this:

flowchart LR
    Mic([mic]) --> AEC["AdaptiveEchoCanceller<br/>(NLMS, length filter_taps)"]
    TTSPlay([TTS playback]) --> Timeline[PlaybackTimeline]
    Timeline -->|reference signal| AEC
    AEC --> DTD{"Double-talk?<br/>mic > dt_threshold &times; echo"}
    DTD -- yes --> Freeze["Freeze NLMS weights<br/>(still apply filter)"]
    DTD -- no --> Adapt[Adapt NLMS weights]
    Freeze --> Out([clean mic to Whisper])
    Adapt --> Out
  • PlaybackTimeline is a thread-safe ring buffer of recently played audio, time-tagged at the moment sd.play() is called by AudioPlayer.
  • AdaptiveEchoCanceller runs per mic chunk:
    1. Looks up the matching reference signal from PlaybackTimeline, allowing for a bulk acoustic delay estimated periodically by cross-correlation (capped at aec.max_delay_ms).
    2. Applies a block-NLMS adaptive filter of length aec.filter_taps. Adaptation step is aec.step_size.
    3. Runs a double-talk detector: if mic energy exceeds aec.dt_threshold × estimated-echo energy, weight adaptation is frozen while the existing filter is still applied. This keeps user speech intact while preventing the adapter from drifting during overlap.

This was added in commit f334773 after several iterations on simpler correlation-based subtraction proved insufficient for typical desktop rooms.

Speaker Diarisation

diarizer.py provides a lightweight per-segment speaker tagger that prefixes confirmed segments with [S1], [S2], … when SystemState.speaker_tagging is on. It is intentionally simple — no embedding model — and is meant for streaming contexts rather than post-hoc analysis.