Audio transcription
On-device speech-to-text, capture windows, multi-device inputs, voice activity detection, noise handling, diarisation, and transcript-only storage.
Last updated: 2 April 2026
What audio transcription does
Overshow captures speech from your chosen inputs and turns it into searchable transcript text on your machine. When you search or review a meeting, you are retrieving what was said as found in those transcripts. not text invented by the product. The privacy benefit is straightforward: the bundled speech model runs on-device, with no cloud transcription step for this core flow and no model download gate for the default engine.
Audio complements screen capture: you can find moments by spoken phrases, speaker labels (when identification is enabled), and time range alongside on-screen context.
Audio files are not persisted. Only transcript text and related metadata (timing, device, speaker labels where available) are stored. That matches the screen side’s “text and metadata only” posture.
The bundled speech model
Overshow ships a high-quality Whisper-based model with the application. That means:
- No separate download is required to start transcribing after install.
- No external API calls are needed for the default transcription path.
- Processing stays on your device, subject to platform capabilities.
Model updates ship with app updates. Your organisation controls upgrade cadence via its normal desktop software policy.
Capture windows
Transcription runs on fixed capture windows rather than a separate “realtime only” toggle path. Each window includes a small overlap with the next so words spanning a boundary are not cut off.
Why windows instead of arbitrary streaming?
Fixed windows stabilise resource usage, simplify alignment with storage and search indexes, and make it easier to apply downstream steps (voice activity detection, noise reduction, speaker separation) on consistent segments.
Languages and automatic detection
The speech model supports 70+ languages with automatic language detection per segment or window as implemented. You do not need to pre-select a locale for many multilingual workflows; accuracy still improves when the dominant language matches your meetings.
Heavy code-switching, strong accents, or domain jargon may produce imperfect transcripts. Search supports partial phrase matches and, where enabled, semantic retrieval over embedded text. useful when literal wording drifts.
Multi-device capture and device management
Overshow can capture from multiple audio inputs simultaneously, for example a headset and a room mic, or interface inputs on a podcast setup. Hot-swap support lets you plug and unplug devices during a session; the stack polls for reconnection so inputs return without a full app restart in typical cases.
Follow system default audio follows the OS default input: when the default changes. Bluetooth disconnect, sleep/wake, plug/unplug. capture tracks the new default so you are not stuck on a stale device after an OS-level switch.
Practical multi-device setups
| Scenario | Suggestion |
|---|---|
| Video calls with music in-room | Capture meeting mic only; exclude unused inputs to reduce bleed |
| Dictation plus system sounds | Prefer headset input; avoid “loopback” unless you intend to index it |
| Interface + lavalier | Enable both inputs; name speakers after diarisation stabilises |
Hot-swap is designed for brief disconnects. If an interface powers down for minutes, expect a short gap until the poll rebinds the stream.
Audio processing pipeline
Incoming audio is processed through a defined chain before the speech model runs.
Audio pipeline stages
| Stage | What happens |
|---|---|
| Ingest | Audio from selected device(s), with hot-swap and default-follow behaviour |
| Normalisation | Level normalisation for consistent loudness |
| Decode / resample | Audio decoding and resampling for the speech model |
| VAD | Voice activity detection to focus effort on speech-bearing regions |
| Noise reduction | Noise reduction on detected speech segments |
| STT | On-device speech model decoding to text |
| Post-text | PII removal on transcript text before persistence |
| Optional diarisation | Speaker separation for speaker segments (see below) |
Noise reduction
Noise reduction estimates a noise floor and attenuates it on segments already flagged as speech by voice activity detection. It is not a substitute for a quiet room, but it steadies levels when fans, HVAC, or laptop hiss sit behind spoken content. Processing remains on-device; no audio leaves the machine for cloud denoising.
Why resampling?
Speech models are optimised for specific audio parameters. Overshow resamples input automatically while keeping dimensions predictable for the runtime. Higher sample rates from hardware are downsampled after decode.
PII removal runs on transcript text before it is stored. Treat it as a safety net alongside pause/resume and organisational policy. not as sole protection for regulated content.
Voice activity detection (VAD)
Two voice activity detection modes are available:
| Mode | Character |
|---|---|
| Neural | Higher accuracy for tricky backgrounds |
| Classical | Fast and CPU-friendly |
Voice activity detection prevents silence and noise from flooding the recogniser and pairs with noise reduction on speech segments.
VAD sensitivity levels
| Level | When to use |
|---|---|
| High | Noisy environments; avoid transcribing constant background hum as speech |
| Medium | Default balance for typical office and home |
| Low | Quiet rooms or very sparse speech where you do not want short utterances skipped |
Raise sensitivity when false speech triggers are common; lower it when short or quiet phrases are missing from transcripts.
Speaker diarisation and identification
Speaker diarisation (who spoke when) uses on-device models to cluster and separate speakers within captured audio. Naming and merging profiles. so search can filter by person. is covered on the identification page.
For labelling workflows, profile hygiene, and filters, see Speaker identification.
Diarisation quality rises with cleaner input, less overlap, and consistent device choice across sessions.
What is stored
| Stored | Not stored |
|---|---|
| Transcript text per window/segment | WAV, MP3, or other raw audio archives |
| Timestamps and device metadata | Long-form recordings for replay |
| Speaker labels when diarisation/ID enabled | Audio blobs for external replay |
This transcript-only posture reduces disk use and aligns with privacy expectations for many organisations.
Configuration in Settings → Recording
Audio options. including inputs, VAD choice and sensitivity, and related toggles. live under Settings → Recording alongside screen controls. Some changes share the relaunch rules documented for monitors and ignored windows on the screen capture page; device hot-swap covers many runtime changes without restart.
Supported configurations
| Area | Supported behaviour |
|---|---|
| Inputs | Multiple simultaneous devices; selection per device |
| Default device follow | Tracks OS default across plug/unplug, sleep/wake, Bluetooth |
| Hot-swap | Automatic reconnection polling |
| Stride / overlap | System defaults; not per-user sliders in core UI |
| VAD | Neural or classical; High / Medium / Low sensitivity |
| STT model | Bundled on-device speech model |
| Languages | 70+ with automatic detection |
| Speaker separation | On-device diarisation; labels via speaker identification |
| Compute | Uses local CPU/GPU per platform build; heavy sessions may increase fan noise. expected |
Performance and resource use
Transcription is continuous while capture runs. Longer strides (within the product’s fixed defaults) reduce wake-ups but are not user-tunable in the core UI: Overshow balances latency, accuracy, and battery for typical laptops. If thermals spike during long calls, close unused GPU-heavy apps; the speech model competes for the same power budget as your meeting client.
Troubleshooting common audio issues
| Symptom | Things to check |
|---|---|
| No transcript | OS microphone permission; correct input selected; not paused; default-follow vs explicit device |
| Gaps at sentence joins | Normal boundary behaviour. overlap mitigates most cases; extreme cross-boundary words are rare edge cases |
| Too much noise transcribed | Raise VAD sensitivity to High; reduce room noise; move closer to mic |
| Short words missing | Lower VAD sensitivity to Low; confirm normalisation is not clipping extremely quiet speech |
| Wrong microphone after dock/undock | Follow system default enabled; or re-select input in settings |
| Bluetooth dropouts | OS reconnect timing; hot-swap poll. wait briefly after reconnect |
| “Unknown” speakers | See Speaker identification. merge duplicates, name after clear utterances |
| Transcript language wrong | Rare auto-detect edge cases; dominant speaker language usually wins; heavy mixing may confuse detection |
| Repeated partial words at window edges | Overlap should absorb most cases; if persistent, note timestamp and file feedback. may be unusually clipped device buffers |
| Echo of your own voice | Use headphones; disable secondary mic capturing room playback |
| USB hub power sag | Try direct port on laptop; unstable power resets interfaces mid-call |
When to pause instead of tuning
If a conversation must not be transcribed under policy, pause capture. No amount of VAD tuning substitutes for a deliberate pause during confidential segments.
Related documentation
- Screen capture for how on-screen text aligns with transcripts in search.
- Speaker identification for profiles, merging, and filters.
- Privacy: on-device processing for how local processing fits your trust model.