Meeting audio transcription

On-device speech-to-text, capture windows, multi-device inputs, voice activity detection, noise handling, diarisation, and transcript-only storage.

Last updated: 2 April 2026

What meeting audio transcription does

Overshow captures speech from your chosen inputs only while meeting audio is active, then turns it into searchable transcript text on your machine. When you search or review a meeting, you are retrieving what was said as found in those transcripts. not text invented by the product. The privacy benefit is straightforward: the bundled speech-to-text model runs on-device, with no cloud transcription step for this core flow and no model download gate for the default engine.

Meeting audio complements screen capture: you can find moments by spoken phrases, speaker labels (when identification is enabled), and time range alongside on-screen context.

Audio files are not persisted. Only transcript text and related metadata (timing, device, speaker labels where available) are stored. That matches the screen side’s “text and metadata only” posture.

The bundled speech model

Overshow ships a high-quality on-device speech-to-text model with the application. That means:

No separate download is required to start transcribing after install.
No external API calls are needed for the default transcription path.
Processing stays on your device, subject to platform capabilities.
Startup is bounded: if the local speech helper stalls while loading or decoding, Overshow restarts that helper instead of leaving transcription in a long warm-up loop.

Model updates ship with app updates. Your organisation controls upgrade cadence via its normal desktop software policy.

Capture windows

Transcription runs on fixed capture windows rather than a separate “realtime only” toggle path. Each window includes a small overlap with the next so words spanning a boundary are not cut off.

Why windows instead of arbitrary streaming?

Fixed windows stabilise resource usage, simplify alignment with storage and search indexes, and make it easier to apply downstream steps (voice activity detection, noise reduction, speaker separation) on consistent segments.

Languages and automatic detection

The speech model supports 18 European languages with automatic language detection per segment or window as implemented. You do not need to pre-select a locale for many multilingual workflows; accuracy still improves when the dominant language matches your meetings.

Heavy code-switching, strong accents, or domain jargon may produce imperfect transcripts. Search supports partial phrase matches and, where enabled, semantic retrieval over embedded text. useful when literal wording drifts.

Multi-device capture and device management

Overshow can capture from multiple audio inputs simultaneously, for example a headset and a room mic, or interface inputs on a podcast setup. Hot-swap support lets you plug and unplug devices during a session; the stack polls for reconnection so inputs return without a full app restart in typical cases.

Follow system default audio follows the OS default input: when the default changes. Bluetooth disconnect, sleep/wake, plug/unplug. capture tracks the new default so you are not stuck on a stale device after an OS-level switch.

Practical multi-device setups

Scenario	Suggestion
Video calls with music in-room	Capture meeting mic only; exclude unused inputs to reduce bleed
Dictation plus system sounds	Prefer headset input; avoid “loopback” unless you intend to index it
Interface + lavalier	Enable both inputs; name speakers after diarisation stabilises

Hot-swap is designed for brief disconnects. If an interface powers down for minutes, expect a short gap until the poll rebinds the stream.

Audio processing pipeline

Incoming audio is processed through a defined chain before the speech model runs.

Audio pipeline stages

Stage	What happens
Ingest	Audio from selected device(s), with hot-swap and default-follow behaviour
Normalisation	Level normalisation for consistent loudness
Decode / resample	Audio decoding and resampling for the speech model
VAD	Voice activity detection to focus effort on speech-bearing regions
Noise reduction	Noise reduction on detected speech segments
STT	On-device speech model decoding to text
Post-text	transcript cleanup and encrypted text persistence
Optional diarisation	Speaker separation for speaker segments (see below)

Noise reduction

Noise reduction estimates a noise floor and attenuates it on segments already flagged as speech by voice activity detection. It is not a substitute for a quiet room, but it steadies levels when fans, HVAC, or laptop hiss sit behind spoken content. Processing remains on-device; no audio leaves the machine for cloud denoising.

Why resampling?

Speech models are optimised for specific audio parameters. Overshow resamples input automatically while keeping dimensions predictable for the runtime. Higher sample rates from hardware are downsampled after decode.

Transcript text and its search indexes live inside the local SQLCipher-encrypted database. The application keeps tokenised plaintext rows there so recall can find names, emails, phone numbers, and other meeting details after keyed open. Use pause/resume and organisational policy for content that should not be captured.

Voice activity detection (VAD)

Two voice activity detection modes are available:

Mode	Character
Neural	Higher accuracy for tricky backgrounds
Classical	Fast and CPU-friendly

Voice activity detection prevents silence and noise from flooding the recogniser and pairs with noise reduction on speech segments.

VAD sensitivity levels

Level	When to use
High	Noisy environments; avoid transcribing constant background hum as speech
Medium	Default balance for typical office and home
Low	Quiet rooms or very sparse speech where you do not want short utterances skipped

Raise sensitivity when false speech triggers are common; lower it when short or quiet phrases are missing from transcripts.

Speaker diarisation and identification

Speaker diarisation (who spoke when) uses on-device models to cluster and separate speakers within captured audio. Naming and merging profiles. so search can filter by person. is covered on the identification page.

For labelling workflows, profile hygiene, and filters, see Speaker identification.

Diarisation quality rises with cleaner input, less overlap, and consistent device choice across sessions.

What is stored

Stored	Not stored
Transcript text per window/segment	WAV, MP3, or other raw audio archives
Timestamps and device metadata	Long-form recordings for replay
Speaker labels when diarisation/ID enabled	Audio blobs for external replay

This transcript-only posture reduces disk use and aligns with privacy expectations for many organisations.

Configuration in Settings → Capture

Audio options. including inputs, VAD choice and sensitivity, and related toggles. live under Settings → Capture alongside screen controls. Some changes share the relaunch rules documented for monitors and ignored windows on the screen capture page; device hot-swap covers many runtime changes without restart.

Supported configurations

Area	Supported behaviour
Inputs	Multiple simultaneous devices; selection per device
Default device follow	Tracks OS default across plug/unplug, sleep/wake, Bluetooth
Hot-swap	Automatic reconnection polling
Window / overlap	Short rolling windows with a small overlap so words are not cut off at boundaries. Not per-user sliders in core UI.
VAD	Neural or classical; High / Medium / Low sensitivity
STT model	Bundled on-device speech model
Languages	18 European languages with automatic detection
Speaker separation	On-device diarisation; labels via speaker identification
Compute	Uses local CPU/GPU per platform build; heavy sessions may increase fan noise. expected

Performance and resource use

Transcription is continuous while capture runs. The capture windows are not user-tunable in the core UI: Overshow balances latency, accuracy, and battery for typical laptops. If thermals spike during long calls, close unused GPU-heavy apps; the speech model competes for the same power budget as your meeting client.

Troubleshooting common audio issues

Symptom	Things to check
No transcript	OS microphone permission; correct input selected; not paused; default-follow vs explicit device
Gaps at sentence joins	Normal boundary behaviour. overlap mitigates most cases; extreme cross-boundary words are rare edge cases
Too much noise transcribed	Raise VAD sensitivity to High; reduce room noise; move closer to mic
Short words missing	Lower VAD sensitivity to Low; confirm normalisation is not clipping extremely quiet speech
Wrong microphone after dock/undock	Follow system default enabled; or re-select input in settings
Bluetooth dropouts	OS reconnect timing; hot-swap poll. wait briefly after reconnect
“Unknown” speakers	See Speaker identification. merge duplicates, name after clear utterances
Transcript language wrong	Rare auto-detect edge cases; dominant speaker language usually wins; heavy mixing may confuse detection
Repeated partial words at window edges	Overlap should absorb most cases; if persistent, note timestamp and file feedback. may be unusually clipped device buffers
Echo of your own voice	Use headphones; disable secondary mic capturing room playback
USB hub power sag	Try direct port on laptop; unstable power resets interfaces mid-call

When to pause instead of tuning

If a conversation must not be transcribed under policy, pause capture. No amount of VAD tuning substitutes for a deliberate pause during confidential segments.

Meeting audio transcription

What meeting audio transcription does

The bundled speech model

Capture windows

Languages and automatic detection

Multi-device capture and device management

Practical multi-device setups

Audio processing pipeline

Audio pipeline stages

Noise reduction

Voice activity detection (VAD)

VAD sensitivity levels

Speaker diarisation and identification

What is stored

Configuration in Settings → Capture

Supported configurations

Performance and resource use

Troubleshooting common audio issues

Related documentation