Skip to content

Document indexing

Watch local folders, extract and chunk text from many file types, and search documents alongside screen and audio captures with keyword and semantic retrieval.

Last updated: 2 April 2026

Purpose and scope

Document indexing brings files you already keep on disk into the same local search surface as screen OCR and audio transcripts. Instead of treating captures and documents as separate silos, you can query once and see relevant passages from PDFs, spreadsheets, notebooks, and source code next to moments from your display and microphone.

This is useful when your authoritative text lives in a project tree, a readings folder, or shared documentation that you are allowed to index locally. Overshow watches paths you choose, extracts text where supported, chunks it for retrieval, and keeps the index aligned as files appear, change, or disappear.

Document indexing is gated by a feature flag in shipping builds. If you do not see directory watching or the Documents settings tab, your build or organisation policy may have this capability disabled.

Supported file formats

The indexer recognises a broad set of formats. Binary or non-text payloads are rejected for plain-text extensions using a binary guard and UTF-8 validation, so corrupted or mislabelled files do not pollute the index.

Documents and data

Category Formats Extraction approach
PDF .pdf On-device text extraction
Word .docx Structured document text extraction
Spreadsheets Excel family, ODS Tabular content extraction

Code and plain text

Category Extensions
Documentation and config .txt, .md, .yaml, .yml, .json, .toml, .xml
Systems and application code .rs, .py, .js, .ts, .go, .java, .c, .cpp, .h, .cs, .rb, .php, .swift, .kt, .scala, .sh
Why so many text extensions?

Knowledge work rarely lives in a single format. Treating source, configuration, and prose under one watcher lets you search “how we configured retries” or “that enum variant name” without remembering whether it lived in Rust, YAML, or a README. The binary guard still applies: a file named .txt that is mostly binary will not be indexed as readable text.

How indexing works

Indexing is a pipeline from disk to searchable chunks. The steps below run on your machine; no document content is sent to external search or embedding APIs for this feature.

Step-by-step pipeline

Step What happens
1. Discovery The watcher observes configured directories and detects new, modified, or removed files matching supported types.
2. Validation Plain-text paths are checked for binary content and valid UTF-8 before parsing.
3. Extraction Format-specific extractors pull text from PDF, DOCX, spreadsheets, or read plain files directly.
4. Chunking Text is split into paragraph-aware chunks sized for retrieval, with overlap between chunks so boundaries align with how embeddings are computed.
5. Indexing Chunks feed keyword search and a semantic index built from on-device embeddings. Full-text entries use redacted, tokenised representations rather than storing raw sensitive plaintext in the search surface.
6. Encryption Chunk payloads at rest are encrypted at rest in line with Overshow’s local storage model.
7. Updates When a file changes, the system applies transactional document updates: chunk rows, FTS rows, and background queue work are replaced atomically so you never see half-updated documents in search results.
Stale chunks and background work

If a chunk is removed after a job was enqueued but before processing completes, the pipeline handles stale references gracefully. Work items that no longer point at live chunks are dropped safely instead of failing the whole queue.

Keyword versus semantic search on documents

Mode Role for documents
Keyword Fast matching on terms, names, and exact phrases surfaced via the tokenised index.
Semantic (embeddings) Finds chunks whose meaning matches your query when wording differs from the source.
Hybrid Combines both signals so document hits rank alongside screen and audio results in a single list.

Embedding generation and background work

After chunks are written, embedding jobs are enqueued for semantic indexing. Generation uses the same on-device embedding model as other semantic features in Overshow. Large directories therefore produce bursts of background CPU or GPU work; the app is designed to process the queue without blocking the search UI, but you may notice activity during initial indexing or after a full rescan.

Full-text index representation

The full-text layer supports fast keyword retrieval, but it does not mirror your files verbatim in the index store. Redacted, tokenised forms reduce exposure of raw sensitive strings in the full-text structure while still allowing stemmed and phrase-style matching. Your chunk payloads remain protected separately through encryption at rest, consistent with the rest of the desktop datastore.

Documents compared with captures
Source Typical content Index shape
Screen OCR Ephemeral UI text, stack traces, chat Time-ordered capture segments
Audio Spoken words via local STT Transcript segments with timing
Documents Authoritative files on disk Versioned chunks per file path

Unified search does not flatten these into one blob; it ranks across parallel indexes so the right modality can win per query.

Adding and managing watched directories

Onboarding

During the onboarding wizard, step 4 offers a native directory picker so you can grant one or more roots without typing paths by hand. You can skip this step and add folders later from settings if you prefer to explore the app first.

Settings

Open Settings, then the Documents tab to:

  • Add or remove watched directories
  • Review which roots are active
  • Understand scope before indexing large trees

Prefer stable project or library roots over entire home directories unless you genuinely need that breadth. Narrower watches reduce background work and make it easier to reason about what is searchable.

Rescan on demand

Each watched directory can be rescanned on demand. Use rescan after bulk copies, git checkouts, or network drive syncs where file system events might have been missed. Rescan reconciles the index with the current filesystem state under that path.

Search experience: documents alongside captures

Visual distinction

In the desktop search UI, document results use a blue FileText-style icon and a document badge so you can distinguish them at a glance from screen captures and audio-derived hits. This keeps mixed result lists scannable without opening every item.

Content type filtering

Use content type filters (where available) to narrow to documents only, or to combine documents with screen and audio as needed. The same query text runs across whichever types you include.

Unified retrieval

Unified search means one query can return:

  • OCR text from screen history
  • Transcript segments from audio
  • Passages from indexed files under watched directories

Ranking and hybrid fusion apply across these sources so the most relevant chunk wins regardless of origin.

When files change or are deleted

Event Typical behaviour
File edited Chunks for that document are replaced atomically; FTS and embedding queues stay consistent.
File deleted Associated chunks are removed; future searches no longer surface that file.
Temporary inconsistency Transactional updates and stale-chunk handling avoid leaving orphan hits in the UI.

Removing a directory from the watch list stops new indexing for that path. Whether previously indexed material is purged immediately depends on product behaviour in your version; if you need certainty, consult release notes or support for retention semantics.

Security and privacy notes

  • At rest: Document chunk content is stored encrypted at rest with the same local encryption posture as other sensitive indexed material.
  • FTS representation: Keyword search uses redacted, tokenised index entries, not a verbatim dump of sensitive plaintext into the full-text store.
  • On device: Extraction, chunking, embeddings, and search run locally; you are not shipping file contents to a cloud document index for this feature.

Watching a folder is an explicit decision to make its supported files searchable on this device. Material under legal hold, client confidentiality regimes, or export-controlled paths should stay outside watched lists unless your organisation’s policy and contractual licence terms permit local indexing.

Tips for organising watched directories

Practice Rationale
One logical root per project Easier rescan, clearer mental model of what is indexed.
Exclude archives you rarely search Reduces noise and background embedding work.
Separate “reference” and “active work” You can rescan reference libraries less often than fast-moving repos.
Align with policy Only watch folders your organisation permits to sit in a local searchable index.
Rescan after large imports Ensures batch operations are fully reflected when events were batched or delayed.
Pairing with screen and audio search

When a decision was both discussed in a meeting (audio) and written in a spec (document), hybrid search across all content types often surfaces both. Filter to documents when you know the answer is in a file; widen to all types when you are unsure.

Operational expectations

Situation What to expect
First watch on a large tree A longer initial indexing phase; embeddings fill in progressively.
Frequent saves in code Transactional updates keep results coherent; churn increases background work.
Network-mounted volumes Latency and missed events may require manual rescan more often.
Unsupported extensions Files are ignored silently or skipped according to product rules; they never appear in search.
If results look stale

Confirm the path is still watched, run rescan for that directory, and check that the document indexing capability is enabled. If a file type is not in the supported tables above, it will not produce chunks until support is added in a future release.