Speaker identification

Automatic speaker separation creates speaker labels, supports profile linking and review, and powers search filters across meetings and recordings.

Last updated: 17 May 2026

Purpose and scope

Speaker identification turns who spoke when in your audio into structured speaker labels you can name, link to profiles, and filter on. Overshow uses on-device speaker diarisation so each recording can be split into speaker-attributed segments without sending audio away.

Speaker identification with colour-coded attribution

On macOS, hardware acceleration supports parts of the pipeline including voice activity detection and speaker identification workloads, keeping processing responsive.

The outcome is not merely a transcript: you get searchable, filterable speaker labels that improve recall after calls, interviews, and long working sessions where several people contribute.

End-to-end pipeline

Stage	Components	Outcome
Capture	Microphone input from Overshow audio capture	Audio buffered for processing
VAD	Neural and classical voice activity detection	Speech versus non-speech regions estimated
Conditioning	Audio normalisation and noise reduction on speech segments (when diarisation enabled)	More stable segments for embedding
Diarisation	On-device segmentation and speaker embedding	“Speaker A / B / …” timelines within each file
Identity layer	Reviewed profile links and optional voice-match suggestions	Safer cross-meeting identity without automatic merging
Acceleration (macOS)	Hardware acceleration for applicable workloads	Lower latency and better battery behaviour on Apple silicon

Hardware acceleration on Apple Silicon

Hardware acceleration is available on Apple Silicon for applicable workloads, which reduces latency and improves battery behaviour during active diarisation.

How voice profiles work

Embeddings and clustering

Each speech segment can be represented by a voice embedding: a numerical fingerprint of timbre and speaking style, not the words themselves. Similar embeddings help the system propose distinct speakers within a recording. Reviewed profile links can seed local voice-match suggestions for later meetings when the calibrated suggestion surface is enabled.

Speaker grouping is probabilistic. Room acoustics, microphone quality, and overlapping speech all influence how cleanly labels form. Overshow surfaces tools to rename, link, and review close matches so your catalogue stays trustworthy.

Segmentation pipeline

Voice activity detection feeds diarisation. Overshow combines neural and classical voice activity detection to estimate where speech occurs before speaker models run. Automatic speaker segmentation then splits the timeline into speaker-attributed regions.

When speaker diarisation is enabled, the pipeline also applies audio normalisation and noise reduction on detected speech segments, which tends to stabilise embeddings and improve clustering under less-than-ideal capture conditions.

Why normalisation and noise reduction matter

Raw levels that swing between quiet laptop mics and loud desk setups can exaggerate superficial differences between clips of the same person. Normalisation and targeted noise reduction on speech regions help the embedding model focus on voice characteristics rather than volume quirks or steady background hum.

Naming and managing speakers

Assigning and changing names

You can assign names to automatically detected speakers when you recognise a voice, and link those speakers to people profiles when you want that identity to carry into search and future review surfaces.

Merging duplicates

The same physical speaker may appear as multiple labels after different microphones, rooms, or emotional tone. Reviewed profile links and similar-speaker suggestions help consolidate those labels without silently merging unrelated people.

Similar speaker detection

Similar speaker detection uses embedding geometry to suggest profiles that might be the same person. Review suggestions before linking. Close embeddings are a hint, not proof, especially for family voices or similar-sounding colleagues.

Clearer audio input improves clustering quality more than any post-processing tweak. A quiet room, a consistent mic position, and avoiding heavy compression where possible all help the model separate speakers cleanly.

Speaker management actions

Action	What it does	When to use it
Assign name	Attaches a human-readable label to an unnamed speaker cluster	After you recognise a voice in review
Rename	Updates the display label for an existing profile	Spelling fixes, preferred names, role-based labels
Link to profile	Connects a speaker label to a person profile	Recurring collaborators and meeting attendees
Similar speaker review	Surfaces embedding-near profiles for manual confirmation	Housekeeping after many meetings
Hallucination marking	Flags false or spurious speaker detections	Cleaning up artefacts from noise or crosstalk
Unnamed speakers query	Lists speakers still needing labels	Periodic maintenance before reporting or handover

Handling false detections

Background noise, keyboard clatter, and low-bit-rate codecs can produce spurious speaker regions. Hallucination marking lets you clean up false detections without pretending the model was perfect.

Treat marking as curatorial: you are training your future self’s search experience, not grading the algorithm.

Aggressive consolidation without checking similar speaker suggestions can hide real participants. Prefer small, evidence-based profile links after listening to short samples or checking meeting context.

Unnamed speakers and housekeeping

Use the unnamed speakers workflow to find clusters that still read as “Speaker 3” style placeholders. Labelling even a handful of recurring voices dramatically improves scanability of long transcripts and post-meeting review.

Linking speakers to meetings

When recordings align with meeting metadata elsewhere in Overshow, speaker labels compound the value: you can move from calendar context to transcript to who said what without re-listening to entire calls.

Search and filters

Speaker filtering

The desktop search UI exposes speaker management in search filters. Restrict results to one or more named (or unnamed) speakers to review a single person’s contributions across days or projects.

How labels improve retrieval

Named speakers turn vague queries (“what did Alex say about the rollout”) into filter-backed queries: text match plus speaker scope. Even partial naming, such as a first name or role-based tag, beats scrolling unlabelled timelines.

Scenario	Benefit
Post-mortems	Isolate one owner’s statements quickly
Interviews	Separate interviewer and guest without manual timestamps
Stand-ups	Trace recurring updates from the same voice
Compliance review	Narrow to a single voice before exporting or citing
Onboarding listening	Find every utterance attributed to a new hire’s cluster

Combining speaker filters with text

Workflow	Suggestion
Exact quote hunt	Keyword mode plus speaker filter
Paraphrased idea	Semantic or hybrid mode plus speaker filter
Unknown wording	Start hybrid, then tighten speaker once a name surfaces

Configuration and pipeline interactions

Speaker identification sits downstream of capture and transcription but upstream of how you filter and search audio-derived content. Enabling diarisation engages the normalisation and noise reduction path on speech segments; disabling it skips that cost when you only need plain transcripts.

Voice activity detection is part of the product’s default segmentation stack; you typically interact with outcomes through settings that enable or emphasise speaker features rather than low-level model toggles. Refer to your app version for exact controls.

If speaker counts look inflated in noisy environments, try improving capture quality before toggling advanced options. Fewer false speech segments mean fewer phantom speakers to link, review, or mark.

Best practices for voice quality

Practice	Effect on identification
Use a consistent primary microphone	Reduces embedding drift for the same person
Minimise overlapping speech	Overlap confuses segmentation boundaries
Reduce fan and keyboard noise at the source	Fewer false VAD triggers and hallucinated speakers
Avoid extreme dynamic range compression	Preserves natural spectral detail embeddings use
Position the mic close enough for clean speech	Weak signals blur speaker boundaries
Prefer wired or high-quality wireless with stable codec	Dropouts create fragmentary segments
Normalise meeting etiquette	One person speaking at a time helps diarisation
Close unused conferencing streams	Phantom channels inject low-level noise into VAD
Test levels before long recordings	Clipping and near-silence both harm embeddings
Prefer native app capture over brittle virtual cables	Stable routing reduces sudden timbre shifts

When quality is limited

Noisy cafes, open offices, and travel

Diarisation still runs, but expect more speaker splits and more unnamed clusters. Use hallucination marking liberally, link only after listening, and accept that some sessions will remain “good enough for text search” rather than perfect speaker attribution. Pausing non-essential capture during the noisiest moments often saves more curation time than aggressive consolidation afterwards.

Room and hardware checklist

Expand for a practical setup review

Acoustics: soft furnishings reduce harsh reflections that colour embeddings differently across rooms.
Gain: set input levels so normal speech peaks comfortably without clipping.
Bluetooth: some headsets switch profiles for calls versus music; stick to one mode per session where possible.
Laptop mics: workable for identification, but desk distance and fan noise are common reasons for extra speaker splits. An external mic is often the single biggest upgrade.

Maintaining clean speaker profiles over time

Weekly or monthly: run through unnamed speakers and assign names for recurring voices.
After major hardware changes: expect new clusters; plan review time rather than fighting duplicate names.
After noisy recordings: use hallucination marking before linking labels, so you do not consolidate real speakers with junk segments.
Before handing off a project: rename speakers to names your team recognises so shared search stays intuitive.

Speaker identification runs on device alongside transcription. It is designed for organisations that want voice-derived structure without shipping raw audio to third-party diarisation APIs for routine work. Always align use with your local policy and consent practices.