07 — STT Module
Source: src/vocal10n/stt/.
Responsibilities
- Capture microphone audio at 16 kHz mono.
- Optionally cancel TTS playback echo before transcription.
- Run FasterWhisper in a sliding-window streaming mode.
- Emit partial (live) and confirmed (stable) text events.
- Filter hallucinations and obvious repetitions.
- Optionally tag speakers (lightweight diarisation).
Files
| File | Role |
|---|---|
audio_capture.py | sounddevice input stream, ring buffer, device enumeration. |
playback_aec.py | PlaybackTimeline and AdaptiveEchoCanceller (NLMS + DTD). |
engine.py | STTEngine — thin FasterWhisper wrapper. |
transcript.py | Segment management; partial vs. confirmed. |
filters.py | Hallucination filter, adjacent-dedup, short-phrase repeat filter, phonetic correction. |
diarizer.py | Optional speaker-id tagging. |
worker.py | Thread that pulls from the ring buffer and drives the engine. |
controller.py | Public API consumed by the STT tab. |
Streaming Strategy
The capture layer pushes 0.2 s frames into a ring buffer. The worker runs a loop that:
- Reads the trailing
window_seconds(default 6.5) of audio. - Calls
STTEngine.transcribe(). ReturnsSegmentResult[]with text, start, end,avg_logprob,no_speech_prob, and per-word confidences when available. - Passes each segment through
filters.HallucinationFilter, then adjacent-dedup and short-phrase repeat suppression to cut runaway loops Whisper sometimes produces on noise. - Splits segments into:
- Pending: segments whose
endis withinconfirm_thresholdseconds of the window tail. - Confirmed: segments older than the threshold or older than
max_segment_age(force-flush to avoid stalling on a long unsplit utterance).
- Pending: segments whose
- Publishes
STT_PARTIALfor the pending tail andSTT_CONFIRMEDfor each freshly-stable segment.
Recognition context — the term files described in
14 — Knowledge Base and RAG — is concatenated
into Whisper’s initial_prompt up to initial_prompt_capacity terms.
Hallucination Filter
filters.py carries multiple defences:
- Static filter list. Phrases loaded from
config/filters.txt(e.g. boilerplate Whisper hallucinations like “Thank you for watching”). Matched segments are dropped. - Adjacent dedup. A segment whose text is identical or near-identical to the previous confirmed segment is suppressed.
- Short-phrase repeat suppression. Detects N-times-repeated short fragments (a known Whisper failure mode on silence).
- Phonetic index (e.g.
stt_terms/context_gaming.txt) — fuzzy match source tokens to expected domain terms to fix “似听非听” cases.
The full filter list is editable from the Knowledge Base tab via
vocal10n.ui.widgets.filter_list_editor.
Playback-Aware AEC
When TTS is enabled, the played-back synthetic voice can re-enter the
microphone and be re-transcribed, producing an echo loop. The AEC layer
in playback_aec.py prevents this:
flowchart LR
Mic([mic]) --> AEC["AdaptiveEchoCanceller<br/>(NLMS, length filter_taps)"]
TTSPlay([TTS playback]) --> Timeline[PlaybackTimeline]
Timeline -->|reference signal| AEC
AEC --> DTD{"Double-talk?<br/>mic > dt_threshold × echo"}
DTD -- yes --> Freeze["Freeze NLMS weights<br/>(still apply filter)"]
DTD -- no --> Adapt[Adapt NLMS weights]
Freeze --> Out([clean mic to Whisper])
Adapt --> Out
PlaybackTimelineis a thread-safe ring buffer of recently played audio, time-tagged at the momentsd.play()is called byAudioPlayer.AdaptiveEchoCancellerruns per mic chunk:- Looks up the matching reference signal from
PlaybackTimeline, allowing for a bulk acoustic delay estimated periodically by cross-correlation (capped ataec.max_delay_ms). - Applies a block-NLMS adaptive filter of length
aec.filter_taps. Adaptation step isaec.step_size. - Runs a double-talk detector: if mic energy exceeds
aec.dt_threshold× estimated-echo energy, weight adaptation is frozen while the existing filter is still applied. This keeps user speech intact while preventing the adapter from drifting during overlap.
- Looks up the matching reference signal from
This was added in commit f334773 after several iterations on simpler
correlation-based subtraction proved insufficient for typical desktop
rooms.
Speaker Diarisation
diarizer.py provides a lightweight per-segment speaker tagger that
prefixes confirmed segments with [S1], [S2], … when
SystemState.speaker_tagging is on. It is intentionally simple — no
embedding model — and is meant for streaming contexts rather than
post-hoc analysis.