Skip to content

Speaker identification

Automatic speaker separation groups speakers, supports naming and merging, and powers search filters across meetings and recordings.

Last updated: 2 April 2026

Purpose and scope

Speaker identification turns who spoke when in your audio into structured speaker profiles you can name, merge, and filter on. Overshow uses on-device speaker diarisation (segmentation plus speaker embedding) so repeated voices cluster into coherent speakers over time.

Speaker identification with colour-coded attribution

On macOS, hardware acceleration supports parts of the pipeline including voice activity detection and speaker identification workloads, keeping processing responsive.

The outcome is not merely a transcript: you get searchable, filterable speaker labels that improve recall after calls, interviews, and long working sessions where several people contribute.

End-to-end pipeline

Stage Components Outcome
Capture Microphone input from Overshow audio capture Audio buffered for processing
VAD Neural and classical voice activity detection Speech versus non-speech regions estimated
Conditioning Audio normalisation and noise reduction on speech segments (when diarisation enabled) More stable segments for embedding
Diarisation On-device segmentation and speaker embedding “Speaker A / B / …” timelines within each file
Identity layer Voice embeddings; clustering across segments Profiles that persist and merge across sessions
Acceleration (macOS) Hardware acceleration for applicable workloads Lower latency and better battery behaviour on Apple silicon
Windows and macOS parity

The speaker identification pipeline is shared across desktop platforms; hardware acceleration is platform-specific. On Windows, the same stages run with the platform-appropriate runtime. Behaviour should feel equivalent, though throughput may differ by hardware.

How voice profiles work

Embeddings and clustering

Each speech segment is represented by a voice embedding: a numerical fingerprint of timbre and speaking style, not the words themselves. Similar embeddings are grouped so the system proposes distinct speakers within a recording and across sessions when the same person appears again.

Automatic grouping is probabilistic. Room acoustics, microphone quality, and overlapping speech all influence how cleanly clusters form. Overshow surfaces tools to rename, merge, and mark errors so your catalogue stays trustworthy.

Segmentation pipeline

Voice activity detection feeds diarisation. Overshow combines neural and classical voice activity detection to estimate where speech occurs before speaker models run. Automatic speaker segmentation then splits the timeline into speaker-attributed regions.

When speaker diarisation is enabled, the pipeline also applies audio normalisation and noise reduction on detected speech segments, which tends to stabilise embeddings and improve clustering under less-than-ideal capture conditions.

Why normalisation and noise reduction matter

Raw levels that swing between quiet laptop mics and loud desk setups can exaggerate superficial differences between clips of the same person. Normalisation and targeted noise reduction on speech regions help the embedding model focus on voice characteristics rather than volume quirks or steady background hum.

Naming and managing speakers

Assigning and changing names

You can assign names to automatically detected speakers when you recognise a voice, and rename profiles when labels drift or you standardise on display names (for example, after importing a calendar attendee list elsewhere in the product).

Merging duplicates

The same physical speaker may appear as multiple clusters after different microphones, rooms, or emotional tone. Merge duplicates combines profiles so search and filters treat them as one voice.

Similar speaker detection

Similar speaker detection uses embedding geometry to suggest profiles that might be the same person. Review suggestions before merging. close embeddings are a hint, not proof, especially for family voices or similar-sounding colleagues.

Clearer audio input improves clustering quality more than any post-processing tweak. A quiet room, a consistent mic position, and avoiding heavy compression where possible all help the model separate speakers cleanly.

Speaker management actions

Action What it does When to use it
Assign name Attaches a human-readable label to an unnamed speaker cluster After you recognise a voice in review
Rename Updates the display label for an existing profile Spelling fixes, preferred names, role-based labels
Merge duplicates Unifies two or more profiles into one Same person split across sessions or devices
Similar speaker review Surfaces embedding-near profiles for manual confirmation Housekeeping after many meetings
Hallucination marking Flags false or spurious speaker detections Cleaning up artefacts from noise or crosstalk
Unnamed speakers query Lists speakers still needing labels Periodic maintenance before reporting or handover

Handling false detections

Background noise, keyboard clatter, and low-bit-rate codecs can produce spurious speaker regions. Hallucination marking lets you clean up false detections without pretending the model was perfect.

Treat marking as curatorial: you are training your future self’s search experience, not grading the algorithm.

Aggressive merging without checking similar speaker suggestions can hide real participants. Prefer small, evidence-based merges after listening to short samples or checking meeting context.

Unnamed speakers and housekeeping

Use the unnamed speakers workflow to find clusters that still read as “Speaker 3” style placeholders. Labelling even a handful of recurring voices dramatically improves scanability of long transcripts and post-meeting review.

Linking speakers to meetings

When recordings align with meeting metadata elsewhere in Overshow, speaker labels compound the value: you can move from calendar context to transcript to who said what without re-listening to entire calls.

Search and filters

Speaker filtering

The desktop search UI exposes speaker management in search filters. Restrict results to one or more named (or unnamed) speakers to review a single person’s contributions across days or projects.

How labels improve retrieval

Named speakers turn vague queries (“what did Alex say about the rollout”) into filter-backed queries: text match plus speaker scope. Even partial naming. first name only, or role-based tags. beats scrolling unlabelled timelines.

Scenario Benefit
Post-mortems Isolate one owner’s statements quickly
Interviews Separate interviewer and guest without manual timestamps
Stand-ups Trace recurring updates from the same voice
Compliance review Narrow to a single voice before exporting or citing
Onboarding listening Find every utterance attributed to a new hire’s cluster

Combining speaker filters with text

Workflow Suggestion
Exact quote hunt Keyword mode plus speaker filter
Paraphrased idea Semantic or hybrid mode plus speaker filter
Unknown wording Start hybrid, then tighten speaker once a name surfaces

Configuration and pipeline interactions

Speaker identification sits downstream of capture and transcription but upstream of how you filter and search audio-derived content. Enabling diarisation engages the normalisation and noise reduction path on speech segments; disabling it skips that cost when you only need plain transcripts.

Voice activity detection is part of the product’s default segmentation stack; you typically interact with outcomes through settings that enable or emphasise speaker features rather than low-level model toggles. refer to your app version for exact controls.

If speaker counts look inflated in noisy environments, try improving capture quality before toggling advanced options. Fewer false speech segments mean fewer phantom speakers to merge or mark.

Best practices for voice quality

Practice Effect on identification
Use a consistent primary microphone Reduces embedding drift for the same person
Minimise overlapping speech Overlap confuses segmentation boundaries
Reduce fan and keyboard noise at the source Fewer false VAD triggers and hallucinated speakers
Avoid extreme dynamic range compression Preserves natural spectral detail embeddings use
Position the mic close enough for clean speech Weak signals blur speaker boundaries
Prefer wired or high-quality wireless with stable codec Dropouts create fragmentary segments
Normalise meeting etiquette One person speaking at a time helps diarisation
Close unused conferencing streams Phantom channels inject low-level noise into VAD
Test levels before long recordings Clipping and near-silence both harm embeddings
Prefer native app capture over brittle virtual cables Stable routing reduces sudden timbre shifts

When quality is limited

Noisy cafes, open offices, and travel

Diarisation still runs, but expect more speaker splits and more unnamed clusters. Use hallucination marking liberally, merge only after listening, and accept that some sessions will remain “good enough for text search” rather than perfect speaker attribution. Pausing non-essential capture during the noisiest moments often saves more curation time than aggressive merging afterwards.

Room and hardware checklist

Expand for a practical setup review
  1. Acoustics: soft furnishings reduce harsh reflections that colour embeddings differently across rooms.
  2. Gain: set input levels so normal speech peaks comfortably without clipping.
  3. Bluetooth: some headsets switch profiles for calls versus music; stick to one mode per session where possible.
  4. Laptop mics: workable for identification, but desk distance and fan noise are common reasons for extra speaker splits. an external mic is often the single biggest upgrade.

Maintaining clean speaker profiles over time

  • Weekly or monthly: run through unnamed speakers and assign names for recurring voices.
  • After major hardware changes: expect new clusters; plan merges rather than fighting duplicate names.
  • After noisy recordings: use hallucination marking before merging, so you do not consolidate real speakers with junk segments.
  • Before handing off a project: rename speakers to names your team recognises so shared search stays intuitive.

Speaker identification runs on device alongside transcription. It is designed for organisations that want voice-derived structure without shipping raw audio to third-party diarisation APIs for routine work. always align use with your local policy and consent practices.