Speaker identification
Automatic speaker separation creates speaker labels, supports profile linking and review, and powers search filters across meetings and recordings.
Last updated: 17 May 2026
Purpose and scope
Speaker identification turns who spoke when in your audio into structured speaker labels you can name, link to profiles, and filter on. Overshow uses on-device speaker diarisation so each recording can be split into speaker-attributed segments without sending audio away.
On macOS, hardware acceleration supports parts of the pipeline including voice activity detection and speaker identification workloads, keeping processing responsive.
The outcome is not merely a transcript: you get searchable, filterable speaker labels that improve recall after calls, interviews, and long working sessions where several people contribute.
End-to-end pipeline
| Stage | Components | Outcome |
|---|---|---|
| Capture | Microphone input from Overshow audio capture | Audio buffered for processing |
| VAD | Neural and classical voice activity detection | Speech versus non-speech regions estimated |
| Conditioning | Audio normalisation and noise reduction on speech segments (when diarisation enabled) | More stable segments for embedding |
| Diarisation | On-device segmentation and speaker embedding | “Speaker A / B / …” timelines within each file |
| Identity layer | Reviewed profile links and optional voice-match suggestions | Safer cross-meeting identity without automatic merging |
| Acceleration (macOS) | Hardware acceleration for applicable workloads | Lower latency and better battery behaviour on Apple silicon |
Hardware acceleration on Apple Silicon
Hardware acceleration is available on Apple Silicon for applicable workloads, which reduces latency and improves battery behaviour during active diarisation.
How voice profiles work
Embeddings and clustering
Each speech segment can be represented by a voice embedding: a numerical fingerprint of timbre and speaking style, not the words themselves. Similar embeddings help the system propose distinct speakers within a recording. Reviewed profile links can seed local voice-match suggestions for later meetings when the calibrated suggestion surface is enabled.
Speaker grouping is probabilistic. Room acoustics, microphone quality, and overlapping speech all influence how cleanly labels form. Overshow surfaces tools to rename, link, and review close matches so your catalogue stays trustworthy.
Segmentation pipeline
Voice activity detection feeds diarisation. Overshow combines neural and classical voice activity detection to estimate where speech occurs before speaker models run. Automatic speaker segmentation then splits the timeline into speaker-attributed regions.
When speaker diarisation is enabled, the pipeline also applies audio normalisation and noise reduction on detected speech segments, which tends to stabilise embeddings and improve clustering under less-than-ideal capture conditions.
Why normalisation and noise reduction matter
Raw levels that swing between quiet laptop mics and loud desk setups can exaggerate superficial differences between clips of the same person. Normalisation and targeted noise reduction on speech regions help the embedding model focus on voice characteristics rather than volume quirks or steady background hum.
Naming and managing speakers
Assigning and changing names
You can assign names to automatically detected speakers when you recognise a voice, and link those speakers to people profiles when you want that identity to carry into search and future review surfaces.
Merging duplicates
The same physical speaker may appear as multiple labels after different microphones, rooms, or emotional tone. Reviewed profile links and similar-speaker suggestions help consolidate those labels without silently merging unrelated people.
Similar speaker detection
Similar speaker detection uses embedding geometry to suggest profiles that might be the same person. Review suggestions before linking. Close embeddings are a hint, not proof, especially for family voices or similar-sounding colleagues.
Clearer audio input improves clustering quality more than any post-processing tweak. A quiet room, a consistent mic position, and avoiding heavy compression where possible all help the model separate speakers cleanly.
Speaker management actions
| Action | What it does | When to use it |
|---|---|---|
| Assign name | Attaches a human-readable label to an unnamed speaker cluster | After you recognise a voice in review |
| Rename | Updates the display label for an existing profile | Spelling fixes, preferred names, role-based labels |
| Link to profile | Connects a speaker label to a person profile | Recurring collaborators and meeting attendees |
| Similar speaker review | Surfaces embedding-near profiles for manual confirmation | Housekeeping after many meetings |
| Hallucination marking | Flags false or spurious speaker detections | Cleaning up artefacts from noise or crosstalk |
| Unnamed speakers query | Lists speakers still needing labels | Periodic maintenance before reporting or handover |
Handling false detections
Background noise, keyboard clatter, and low-bit-rate codecs can produce spurious speaker regions. Hallucination marking lets you clean up false detections without pretending the model was perfect.
Treat marking as curatorial: you are training your future self’s search experience, not grading the algorithm.
Aggressive consolidation without checking similar speaker suggestions can hide real participants. Prefer small, evidence-based profile links after listening to short samples or checking meeting context.
Unnamed speakers and housekeeping
Use the unnamed speakers workflow to find clusters that still read as “Speaker 3” style placeholders. Labelling even a handful of recurring voices dramatically improves scanability of long transcripts and post-meeting review.
Linking speakers to meetings
When recordings align with meeting metadata elsewhere in Overshow, speaker labels compound the value: you can move from calendar context to transcript to who said what without re-listening to entire calls.
Search and filters
Speaker filtering
The desktop search UI exposes speaker management in search filters. Restrict results to one or more named (or unnamed) speakers to review a single person’s contributions across days or projects.
How labels improve retrieval
Named speakers turn vague queries (“what did Alex say about the rollout”) into filter-backed queries: text match plus speaker scope. Even partial naming, such as a first name or role-based tag, beats scrolling unlabelled timelines.
| Scenario | Benefit |
|---|---|
| Post-mortems | Isolate one owner’s statements quickly |
| Interviews | Separate interviewer and guest without manual timestamps |
| Stand-ups | Trace recurring updates from the same voice |
| Compliance review | Narrow to a single voice before exporting or citing |
| Onboarding listening | Find every utterance attributed to a new hire’s cluster |
Combining speaker filters with text
| Workflow | Suggestion |
|---|---|
| Exact quote hunt | Keyword mode plus speaker filter |
| Paraphrased idea | Semantic or hybrid mode plus speaker filter |
| Unknown wording | Start hybrid, then tighten speaker once a name surfaces |
Configuration and pipeline interactions
Speaker identification sits downstream of capture and transcription but upstream of how you filter and search audio-derived content. Enabling diarisation engages the normalisation and noise reduction path on speech segments; disabling it skips that cost when you only need plain transcripts.
Voice activity detection is part of the product’s default segmentation stack; you typically interact with outcomes through settings that enable or emphasise speaker features rather than low-level model toggles. Refer to your app version for exact controls.
If speaker counts look inflated in noisy environments, try improving capture quality before toggling advanced options. Fewer false speech segments mean fewer phantom speakers to link, review, or mark.
Best practices for voice quality
| Practice | Effect on identification |
|---|---|
| Use a consistent primary microphone | Reduces embedding drift for the same person |
| Minimise overlapping speech | Overlap confuses segmentation boundaries |
| Reduce fan and keyboard noise at the source | Fewer false VAD triggers and hallucinated speakers |
| Avoid extreme dynamic range compression | Preserves natural spectral detail embeddings use |
| Position the mic close enough for clean speech | Weak signals blur speaker boundaries |
| Prefer wired or high-quality wireless with stable codec | Dropouts create fragmentary segments |
| Normalise meeting etiquette | One person speaking at a time helps diarisation |
| Close unused conferencing streams | Phantom channels inject low-level noise into VAD |
| Test levels before long recordings | Clipping and near-silence both harm embeddings |
| Prefer native app capture over brittle virtual cables | Stable routing reduces sudden timbre shifts |
When quality is limited
Noisy cafes, open offices, and travel
Diarisation still runs, but expect more speaker splits and more unnamed clusters. Use hallucination marking liberally, link only after listening, and accept that some sessions will remain “good enough for text search” rather than perfect speaker attribution. Pausing non-essential capture during the noisiest moments often saves more curation time than aggressive consolidation afterwards.
Room and hardware checklist
Expand for a practical setup review
- Acoustics: soft furnishings reduce harsh reflections that colour embeddings differently across rooms.
- Gain: set input levels so normal speech peaks comfortably without clipping.
- Bluetooth: some headsets switch profiles for calls versus music; stick to one mode per session where possible.
- Laptop mics: workable for identification, but desk distance and fan noise are common reasons for extra speaker splits. An external mic is often the single biggest upgrade.
Maintaining clean speaker profiles over time
- Weekly or monthly: run through unnamed speakers and assign names for recurring voices.
- After major hardware changes: expect new clusters; plan review time rather than fighting duplicate names.
- After noisy recordings: use hallucination marking before linking labels, so you do not consolidate real speakers with junk segments.
- Before handing off a project: rename speakers to names your team recognises so shared search stays intuitive.
Speaker identification runs on device alongside transcription. It is designed for organisations that want voice-derived structure without shipping raw audio to third-party diarisation APIs for routine work. Always align use with your local policy and consent practices.