Document indexing
Watch local folders, extract and chunk text from many file types, and search documents alongside screen and audio captures with keyword and semantic retrieval.
Last updated: 2 April 2026
Purpose and scope
Document indexing brings files you already keep on disk into the same local search surface as screen OCR and audio transcripts. Instead of treating captures and documents as separate silos, you can query once and see relevant passages from PDFs, spreadsheets, notebooks, and source code next to moments from your display and microphone.
This is useful when your authoritative text lives in a project tree, a readings folder, or shared documentation that you are allowed to index locally. Overshow watches paths you choose, extracts text where supported, chunks it for retrieval, and keeps the index aligned as files appear, change, or disappear.
Document indexing is gated by a feature flag in shipping builds. If you do not see directory watching or the Documents settings tab, your build or organisation policy may have this capability disabled.
Supported file formats
The indexer recognises a broad set of formats. Binary or non-text payloads are rejected for plain-text extensions using a binary guard and UTF-8 validation, so corrupted or mislabelled files do not pollute the index.
Documents and data
| Category | Formats | Extraction approach |
|---|---|---|
.pdf |
On-device text extraction | |
| Word | .docx |
Structured document text extraction |
| Spreadsheets | Excel family, ODS | Tabular content extraction |
Code and plain text
| Category | Extensions |
|---|---|
| Documentation and config | .txt, .md, .yaml, .yml, .json, .toml, .xml |
| Systems and application code | .rs, .py, .js, .ts, .go, .java, .c, .cpp, .h, .cs, .rb, .php, .swift, .kt, .scala, .sh |
Why so many text extensions?
Knowledge work rarely lives in a single format. Treating source, configuration, and prose under one watcher lets you search “how we configured retries” or “that enum variant name” without remembering whether it lived in Rust, YAML, or a README. The binary guard still applies: a file named .txt that is mostly binary will not be indexed as readable text.
How indexing works
Indexing is a pipeline from disk to searchable chunks. The steps below run on your machine; no document content is sent to external search or embedding APIs for this feature.
Step-by-step pipeline
| Step | What happens |
|---|---|
| 1. Discovery | The watcher observes configured directories and detects new, modified, or removed files matching supported types. |
| 2. Validation | Plain-text paths are checked for binary content and valid UTF-8 before parsing. |
| 3. Extraction | Format-specific extractors pull text from PDF, DOCX, spreadsheets, or read plain files directly. |
| 4. Chunking | Text is split into paragraph-aware chunks sized for retrieval, with overlap between chunks so boundaries align with how embeddings are computed. |
| 5. Indexing | Chunks feed keyword search and a semantic index built from on-device embeddings. Full-text entries use redacted, tokenised representations rather than storing raw sensitive plaintext in the search surface. |
| 6. Encryption | Chunk payloads at rest are encrypted at rest in line with Overshow’s local storage model. |
| 7. Updates | When a file changes, the system applies transactional document updates: chunk rows, FTS rows, and background queue work are replaced atomically so you never see half-updated documents in search results. |
Stale chunks and background work
If a chunk is removed after a job was enqueued but before processing completes, the pipeline handles stale references gracefully. Work items that no longer point at live chunks are dropped safely instead of failing the whole queue.
Keyword versus semantic search on documents
| Mode | Role for documents |
|---|---|
| Keyword | Fast matching on terms, names, and exact phrases surfaced via the tokenised index. |
| Semantic (embeddings) | Finds chunks whose meaning matches your query when wording differs from the source. |
| Hybrid | Combines both signals so document hits rank alongside screen and audio results in a single list. |
Embedding generation and background work
After chunks are written, embedding jobs are enqueued for semantic indexing. Generation uses the same on-device embedding model as other semantic features in Overshow. Large directories therefore produce bursts of background CPU or GPU work; the app is designed to process the queue without blocking the search UI, but you may notice activity during initial indexing or after a full rescan.
Full-text index representation
The full-text layer supports fast keyword retrieval, but it does not mirror your files verbatim in the index store. Redacted, tokenised forms reduce exposure of raw sensitive strings in the full-text structure while still allowing stemmed and phrase-style matching. Your chunk payloads remain protected separately through encryption at rest, consistent with the rest of the desktop datastore.
Documents compared with captures
| Source | Typical content | Index shape |
|---|---|---|
| Screen OCR | Ephemeral UI text, stack traces, chat | Time-ordered capture segments |
| Audio | Spoken words via local STT | Transcript segments with timing |
| Documents | Authoritative files on disk | Versioned chunks per file path |
Unified search does not flatten these into one blob; it ranks across parallel indexes so the right modality can win per query.
Adding and managing watched directories
Onboarding
During the onboarding wizard, step 4 offers a native directory picker so you can grant one or more roots without typing paths by hand. You can skip this step and add folders later from settings if you prefer to explore the app first.
Settings
Open Settings, then the Documents tab to:
- Add or remove watched directories
- Review which roots are active
- Understand scope before indexing large trees
Prefer stable project or library roots over entire home directories unless you genuinely need that breadth. Narrower watches reduce background work and make it easier to reason about what is searchable.
Rescan on demand
Each watched directory can be rescanned on demand. Use rescan after bulk copies, git checkouts, or network drive syncs where file system events might have been missed. Rescan reconciles the index with the current filesystem state under that path.
Search experience: documents alongside captures
Visual distinction
In the desktop search UI, document results use a blue FileText-style icon and a document badge so you can distinguish them at a glance from screen captures and audio-derived hits. This keeps mixed result lists scannable without opening every item.
Content type filtering
Use content type filters (where available) to narrow to documents only, or to combine documents with screen and audio as needed. The same query text runs across whichever types you include.
Unified retrieval
Unified search means one query can return:
- OCR text from screen history
- Transcript segments from audio
- Passages from indexed files under watched directories
Ranking and hybrid fusion apply across these sources so the most relevant chunk wins regardless of origin.
When files change or are deleted
| Event | Typical behaviour |
|---|---|
| File edited | Chunks for that document are replaced atomically; FTS and embedding queues stay consistent. |
| File deleted | Associated chunks are removed; future searches no longer surface that file. |
| Temporary inconsistency | Transactional updates and stale-chunk handling avoid leaving orphan hits in the UI. |
Removing a directory from the watch list stops new indexing for that path. Whether previously indexed material is purged immediately depends on product behaviour in your version; if you need certainty, consult release notes or support for retention semantics.
Security and privacy notes
- At rest: Document chunk content is stored encrypted at rest with the same local encryption posture as other sensitive indexed material.
- FTS representation: Keyword search uses redacted, tokenised index entries, not a verbatim dump of sensitive plaintext into the full-text store.
- On device: Extraction, chunking, embeddings, and search run locally; you are not shipping file contents to a cloud document index for this feature.
Watching a folder is an explicit decision to make its supported files searchable on this device. Material under legal hold, client confidentiality regimes, or export-controlled paths should stay outside watched lists unless your organisation’s policy and contractual licence terms permit local indexing.
Tips for organising watched directories
| Practice | Rationale |
|---|---|
| One logical root per project | Easier rescan, clearer mental model of what is indexed. |
| Exclude archives you rarely search | Reduces noise and background embedding work. |
| Separate “reference” and “active work” | You can rescan reference libraries less often than fast-moving repos. |
| Align with policy | Only watch folders your organisation permits to sit in a local searchable index. |
| Rescan after large imports | Ensures batch operations are fully reflected when events were batched or delayed. |
Pairing with screen and audio search
When a decision was both discussed in a meeting (audio) and written in a spec (document), hybrid search across all content types often surfaces both. Filter to documents when you know the answer is in a file; widen to all types when you are unsure.
Operational expectations
| Situation | What to expect |
|---|---|
| First watch on a large tree | A longer initial indexing phase; embeddings fill in progressively. |
| Frequent saves in code | Transactional updates keep results coherent; churn increases background work. |
| Network-mounted volumes | Latency and missed events may require manual rescan more often. |
| Unsupported extensions | Files are ignored silently or skipped according to product rules; they never appear in search. |
If results look stale
Confirm the path is still watched, run rescan for that directory, and check that the document indexing capability is enabled. If a file type is not in the supported tables above, it will not produce chunks until support is added in a future release.