Skip to content

Audio transcription

On-device speech-to-text, capture windows, multi-device inputs, voice activity detection, noise handling, diarisation, and transcript-only storage.

Last updated: 2 April 2026

What audio transcription does

Overshow captures speech from your chosen inputs and turns it into searchable transcript text on your machine. When you search or review a meeting, you are retrieving what was said as found in those transcripts. not text invented by the product. The privacy benefit is straightforward: the bundled speech model runs on-device, with no cloud transcription step for this core flow and no model download gate for the default engine.

Audio waveform with transcript output

Audio complements screen capture: you can find moments by spoken phrases, speaker labels (when identification is enabled), and time range alongside on-screen context.

Audio files are not persisted. Only transcript text and related metadata (timing, device, speaker labels where available) are stored. That matches the screen side’s “text and metadata only” posture.

The bundled speech model

Overshow ships a high-quality Whisper-based model with the application. That means:

  • No separate download is required to start transcribing after install.
  • No external API calls are needed for the default transcription path.
  • Processing stays on your device, subject to platform capabilities.

Model updates ship with app updates. Your organisation controls upgrade cadence via its normal desktop software policy.

Capture windows

Transcription runs on fixed capture windows rather than a separate “realtime only” toggle path. Each window includes a small overlap with the next so words spanning a boundary are not cut off.

Why windows instead of arbitrary streaming?

Fixed windows stabilise resource usage, simplify alignment with storage and search indexes, and make it easier to apply downstream steps (voice activity detection, noise reduction, speaker separation) on consistent segments.

Languages and automatic detection

The speech model supports 70+ languages with automatic language detection per segment or window as implemented. You do not need to pre-select a locale for many multilingual workflows; accuracy still improves when the dominant language matches your meetings.

Heavy code-switching, strong accents, or domain jargon may produce imperfect transcripts. Search supports partial phrase matches and, where enabled, semantic retrieval over embedded text. useful when literal wording drifts.

Multi-device capture and device management

Overshow can capture from multiple audio inputs simultaneously, for example a headset and a room mic, or interface inputs on a podcast setup. Hot-swap support lets you plug and unplug devices during a session; the stack polls for reconnection so inputs return without a full app restart in typical cases.

Follow system default audio follows the OS default input: when the default changes. Bluetooth disconnect, sleep/wake, plug/unplug. capture tracks the new default so you are not stuck on a stale device after an OS-level switch.

Practical multi-device setups

Scenario Suggestion
Video calls with music in-room Capture meeting mic only; exclude unused inputs to reduce bleed
Dictation plus system sounds Prefer headset input; avoid “loopback” unless you intend to index it
Interface + lavalier Enable both inputs; name speakers after diarisation stabilises

Hot-swap is designed for brief disconnects. If an interface powers down for minutes, expect a short gap until the poll rebinds the stream.

Audio processing pipeline

Incoming audio is processed through a defined chain before the speech model runs.

Audio pipeline stages

Stage What happens
Ingest Audio from selected device(s), with hot-swap and default-follow behaviour
Normalisation Level normalisation for consistent loudness
Decode / resample Audio decoding and resampling for the speech model
VAD Voice activity detection to focus effort on speech-bearing regions
Noise reduction Noise reduction on detected speech segments
STT On-device speech model decoding to text
Post-text PII removal on transcript text before persistence
Optional diarisation Speaker separation for speaker segments (see below)

Noise reduction

Noise reduction estimates a noise floor and attenuates it on segments already flagged as speech by voice activity detection. It is not a substitute for a quiet room, but it steadies levels when fans, HVAC, or laptop hiss sit behind spoken content. Processing remains on-device; no audio leaves the machine for cloud denoising.

Why resampling?

Speech models are optimised for specific audio parameters. Overshow resamples input automatically while keeping dimensions predictable for the runtime. Higher sample rates from hardware are downsampled after decode.

PII removal runs on transcript text before it is stored. Treat it as a safety net alongside pause/resume and organisational policy. not as sole protection for regulated content.

Voice activity detection (VAD)

Two voice activity detection modes are available:

Mode Character
Neural Higher accuracy for tricky backgrounds
Classical Fast and CPU-friendly

Voice activity detection prevents silence and noise from flooding the recogniser and pairs with noise reduction on speech segments.

VAD sensitivity levels

Level When to use
High Noisy environments; avoid transcribing constant background hum as speech
Medium Default balance for typical office and home
Low Quiet rooms or very sparse speech where you do not want short utterances skipped

Raise sensitivity when false speech triggers are common; lower it when short or quiet phrases are missing from transcripts.

Speaker diarisation and identification

Speaker diarisation (who spoke when) uses on-device models to cluster and separate speakers within captured audio. Naming and merging profiles. so search can filter by person. is covered on the identification page.

For labelling workflows, profile hygiene, and filters, see Speaker identification.

Diarisation quality rises with cleaner input, less overlap, and consistent device choice across sessions.

What is stored

Stored Not stored
Transcript text per window/segment WAV, MP3, or other raw audio archives
Timestamps and device metadata Long-form recordings for replay
Speaker labels when diarisation/ID enabled Audio blobs for external replay

This transcript-only posture reduces disk use and aligns with privacy expectations for many organisations.

Configuration in Settings → Recording

Audio options. including inputs, VAD choice and sensitivity, and related toggles. live under Settings → Recording alongside screen controls. Some changes share the relaunch rules documented for monitors and ignored windows on the screen capture page; device hot-swap covers many runtime changes without restart.

Supported configurations

Area Supported behaviour
Inputs Multiple simultaneous devices; selection per device
Default device follow Tracks OS default across plug/unplug, sleep/wake, Bluetooth
Hot-swap Automatic reconnection polling
Stride / overlap System defaults; not per-user sliders in core UI
VAD Neural or classical; High / Medium / Low sensitivity
STT model Bundled on-device speech model
Languages 70+ with automatic detection
Speaker separation On-device diarisation; labels via speaker identification
Compute Uses local CPU/GPU per platform build; heavy sessions may increase fan noise. expected

Performance and resource use

Transcription is continuous while capture runs. Longer strides (within the product’s fixed defaults) reduce wake-ups but are not user-tunable in the core UI: Overshow balances latency, accuracy, and battery for typical laptops. If thermals spike during long calls, close unused GPU-heavy apps; the speech model competes for the same power budget as your meeting client.

Troubleshooting common audio issues

Symptom Things to check
No transcript OS microphone permission; correct input selected; not paused; default-follow vs explicit device
Gaps at sentence joins Normal boundary behaviour. overlap mitigates most cases; extreme cross-boundary words are rare edge cases
Too much noise transcribed Raise VAD sensitivity to High; reduce room noise; move closer to mic
Short words missing Lower VAD sensitivity to Low; confirm normalisation is not clipping extremely quiet speech
Wrong microphone after dock/undock Follow system default enabled; or re-select input in settings
Bluetooth dropouts OS reconnect timing; hot-swap poll. wait briefly after reconnect
“Unknown” speakers See Speaker identification. merge duplicates, name after clear utterances
Transcript language wrong Rare auto-detect edge cases; dominant speaker language usually wins; heavy mixing may confuse detection
Repeated partial words at window edges Overlap should absorb most cases; if persistent, note timestamp and file feedback. may be unusually clipped device buffers
Echo of your own voice Use headphones; disable secondary mic capturing room playback
USB hub power sag Try direct port on laptop; unstable power resets interfaces mid-call
When to pause instead of tuning

If a conversation must not be transcribed under policy, pause capture. No amount of VAD tuning substitutes for a deliberate pause during confidential segments.

Related documentation