Document indexing

Watch local folders, extract and chunk text from many file types, and search documents alongside screen captures and meeting audio with keyword and semantic retrieval.

Last updated: 29 May 2026

Purpose and scope

Document indexing brings files you already keep on disk into the same local search surface as screen OCR and audio transcripts. Instead of treating captures and documents as separate silos, you can query once and see relevant passages from PDFs, spreadsheets, notebooks, and source code next to moments from your display and microphone.

This is useful when your authoritative text lives in a project tree, a readings folder, or shared documentation that you are allowed to index locally. Overshow watches paths you choose, extracts text where supported, chunks it for retrieval, and keeps the index aligned as files appear, change, or disappear.

Document indexing is off until you choose a folder to watch. Nothing on your disk is indexed until you make that explicit choice. Onboarding asks you to choose a watched folder (or skip). You can add or change watched folders any time from Settings → Documents.

Supported file formats

The indexer recognises a broad set of formats. Binary or non-text payloads are rejected for plain-text extensions using a binary guard and UTF-8 validation, so corrupted or mislabelled files do not pollute the index.

Documents and data

Category	Formats	Extraction approach
PDF	`.pdf`	On-device text extraction
Word	`.docx`	Structured document text extraction
Spreadsheets	Excel family, ODS	Tabular content extraction

Code and plain text

Category	Extensions
Documentation and config	`.txt`, `.md`, `.yaml`, `.yml`, `.json`, `.toml`, `.xml`
Systems and application code	`.rs`, `.py`, `.js`, `.ts`, `.go`, `.java`, `.c`, `.cpp`, `.h`, `.cs`, `.rb`, `.php`, `.swift`, `.kt`, `.scala`, `.sh`

Why so many text extensions?

Knowledge work rarely lives in a single format. Treating source, configuration, and prose under one watcher lets you search “how we configured retries” or “that enum variant name” without remembering whether it lived in Rust, YAML, or a README. The binary guard still applies: a file named .txt that is mostly binary will not be indexed as readable text.

How indexing works

Indexing is a pipeline from disk to searchable chunks. The steps below run on your machine; no document content is sent to external search or embedding APIs for this feature.

Step-by-step pipeline

Step	What happens
1. Discovery	The watcher observes configured directories and detects new, modified, or removed files matching supported types.
2. Validation	Plain-text paths are checked for binary content and valid UTF-8 before parsing.
3. Extraction	Format-specific extractors pull text from PDF, DOCX, spreadsheets, or read plain files directly.
4. Chunking	Text is split into paragraph-aware chunks sized for retrieval, with overlap between chunks so boundaries align with how embeddings are computed.
5. Indexing	Chunks feed keyword search and a semantic index built from on-device embeddings. Full-text entries use tokenised plaintext so keyword search can match file content without decrypting each row.
6. Encryption	Chunk payloads at rest are encrypted at rest in line with Overshow’s local storage model.
7. Updates	When a file changes, the system applies transactional document updates: chunk rows, FTS rows, and background queue work are replaced atomically so you never see half-updated documents in search results.

Stale chunks and background work

If a chunk is removed after a job was enqueued but before processing completes, the pipeline handles stale references gracefully. Work items that no longer point at live chunks are dropped safely instead of failing the whole queue.

Keyword versus semantic search on documents

Mode	Role for documents
Keyword	Fast matching on terms, names, and exact phrases surfaced via the tokenised index.
Semantic (embeddings)	Finds chunks whose meaning matches your query when wording differs from the source.
Hybrid	Combines both signals so document hits rank alongside screen and audio results in a single list.

Embedding generation and background work

After chunks are written, embedding jobs are enqueued for semantic indexing. Generation uses the same on-device embedding model as other semantic features in Overshow. Large directories therefore produce bursts of background CPU or GPU work; the app is designed to process the queue without blocking the search UI, but you may notice activity during initial indexing or after a full rescan.

Full-text index representation

The full-text layer supports fast keyword retrieval, but it does not mirror your files verbatim in the index store. Redacted, tokenised forms reduce exposure of raw sensitive strings in the full-text structure while still allowing stemmed and phrase-style matching. Your chunk payloads remain protected separately through encryption at rest, consistent with the rest of the desktop datastore.

Documents compared with captures

Source	Typical content	Index shape
Screen OCR	Ephemeral UI text, stack traces, chat	Time-ordered capture segments
Audio	Spoken words via local STT	Transcript segments with timing
Documents	Authoritative files on disk	Versioned chunks per file path

Unified search does not flatten these into one blob; it ranks across parallel indexes so the right modality can win per query.

Adding and managing watched directories

Settings

When document indexing is enabled, open Settings, then the Documents tab to:

Add or remove watched directories
Review which roots are active
Understand scope before indexing large trees

Prefer stable project or library roots over entire home directories unless you genuinely need that breadth. Narrower watches reduce background work and make it easier to reason about what is searchable.

Rescan on demand

Each watched directory can be rescanned on demand. Use rescan after bulk copies, git checkouts, or network drive syncs where file system events might have been missed. Rescan reconciles the index with the current filesystem state under that path.

Search experience: documents alongside captures

Visual distinction

In the desktop search UI, document results use a blue FileText-style icon and a document badge so you can distinguish them at a glance from screen captures and audio-derived hits. This keeps mixed result lists scannable without opening every item.

Content type filtering

Use content type filters (where available) to narrow to documents only, or to combine documents with screen and audio as needed. The same query text runs across whichever types you include.

Unified retrieval

Unified search means one query can return:

OCR text from screen history
Transcript segments from audio
Passages from indexed files under watched directories

Ranking and hybrid fusion apply across these sources so the most relevant chunk wins regardless of origin.

When files change or are deleted

Event	Typical behaviour
File edited	Chunks for that document are replaced atomically; FTS and embedding queues stay consistent.
File deleted	Associated chunks are removed; future searches no longer surface that file.
Temporary inconsistency	Transactional updates and stale-chunk handling avoid leaving orphan hits in the UI.

Removing a directory from the watch list stops new indexing and immediately purges the previously indexed chunks for that path from the local index.

Security and privacy notes

At rest: Document chunk content is stored inside the local SQLCipher-encrypted database with the same posture as other sensitive indexed material.
FTS representation: Keyword search uses tokenised plaintext index entries inside that encrypted database for search performance.
On device: Extraction, chunking, embeddings, and search run locally; you are not shipping file contents to a cloud document index for this feature.

Watching a folder is an explicit decision to make its supported files searchable on this device. Material under legal hold, client confidentiality regimes, or export-controlled paths should stay outside watched lists unless your organisation’s policy and contractual licence terms permit local indexing.

Tips for organising watched directories

Practice	Rationale
One logical root per project	Easier rescan, clearer mental model of what is indexed.
Exclude archives you rarely search	Reduces noise and background embedding work.
Separate “reference” and “active work”	You can rescan reference libraries less often than fast-moving repos.
Align with policy	Only watch folders your organisation permits to sit in a local searchable index.
Rescan after large imports	Ensures batch operations are fully reflected when events were batched or delayed.

Pairing with screen and audio search

When a decision was both discussed in a meeting (audio) and written in a spec (document), hybrid search across all content types often surfaces both. Filter to documents when you know the answer is in a file; widen to all types when you are unsure.

Operational expectations

Situation	What to expect
First watch on a large tree	A longer initial indexing phase; embeddings fill in progressively.
Frequent saves in code	Transactional updates keep results coherent; churn increases background work.
Network-mounted volumes	Latency and missed events may require manual rescan more often.
Unsupported extensions	Files are ignored silently or skipped according to product rules; they never appear in search.

If results look stale

Confirm the path is still watched, run rescan for that directory, and check that the document indexing capability is enabled. If a file type is not in the supported tables above, it will not produce chunks until support is added in a future release.