Embedding configuration
Embeddings power QRY's RAG context (Retrieval-Augmented Generation): the system that surfaces relevant snippets from uploaded files, workspace memory, and other text-heavy artifacts at query time. The choice of embedding model affects what content is recallable, with what fidelity, and at what cost.
Two embedding types
| Type | Model | Chunk size | Use for |
|---|---|---|---|
| Text-only | gemini-embedding-001 | 2,000 chars | PDFs, Markdown, plain text, anything where layout doesn't matter |
| Multimodal | multimodalembedding@001 | 900 chars | PDFs with images, screenshots, diagrams — anything where pixels matter |
The chunk size differs because multimodal embeddings have a smaller per-vector context. A workspace using multimodal embeddings gets more, smaller chunks for the same source documents.
You cannot mix types within one context
A workspace's RAG context is either text-only or multimodal — never both. A database constraint enforces this: cross-type similarity scores are mathematically meaningless (different vector spaces), so allowing the mix would surface garbage on retrieval.
Pick one per workspace. Most tenants default to text-only because the cost is lower and the quality is adequate for the typical workspace content.
Configuring globally
Admin > System Settings > Embeddings:
- Default embedding type — text-only or multimodal. Applies to new workspaces.
- Provider credentials — Gemini API key or service-account JSON.
- Restart policy — what happens after config changes (next section).
Configuring per-workspace
Each workspace can override the default. Workspace settings > Files > Embedding type.
Once a workspace has indexed even one file, the type is locked. Switching it would invalidate all existing embeddings — QRY doesn't auto-reindex; you'd have to delete the workspace and recreate it.
RETRIEVAL_DOCUMENT vs RETRIEVAL_QUERY
Gemini's embedding API takes a task_type parameter that affects vector orientation. QRY uses two:
RETRIEVAL_DOCUMENT— when indexing source files. Optimised for being-found.RETRIEVAL_QUERY— when embedding the user's chat input to retrieve against the index. Optimised for finding.
Mixing these (e.g. embedding a query as a document) drops retrieval quality silently. QRY handles this correctly internally — the gotcha is only relevant if you're customising the embedding pipeline.
The Gemini API param gotcha
Gemini's embedding API is contents (plural), not content. Easy to miss when reading provider docs. QRY's wrapper enforces contents correctly. Only relevant if you're calling the API directly.
Restart workers after config changes
This is the operational gotcha. The embedding service is loaded once per Celery worker on startup. Config changes (provider, model, chunk size, task_type) do not take effect until the worker restarts.
After making config changes:
kubectl rollout restart deployment/celery-worker -n qry-app
Or for in-flight queries to bleed out gracefully:
kubectl scale deployment/celery-worker --replicas=0 -n qry-app
# wait for pods to drain
kubectl scale deployment/celery-worker --replicas=N -n qry-app # whatever your normal count is
If you forget the restart, you'll see the old behaviour and assume the config didn't save. It saved; the worker hasn't picked it up.
Health check: embeddings should never be zero vectors
A zero vector from the embedding API means the API errored silently and QRY didn't catch it. The retrieval system collapses (every document looks identical to every query). After config changes, sanity-check by uploading one test file and confirming non-zero vectors:
SELECT length(embedding), embedding[1:5]
FROM file_embeddings
ORDER BY indexed_at DESC LIMIT 5;
Costs
Both embedding models charge per 1k tokens of input. Multimodal is more expensive per call. For a tenant with heavy file ingestion, expect embedding costs to be a meaningful line item — sample your usage and forecast before scaling.
Common issues
Files marked "Indexing" forever.
Either the embedding worker is stuck (check Celery health) or the API is returning errors silently. Look at worker logs for embedding error patterns.
Search recall feels worse after I changed the model. You changed model but didn't reindex existing files. New uploads are searched with the new model; old uploads still have the old model's vectors. Cross-model similarity is poor — same problem as cross-type.
Test query returns identical content for every prompt. Zero-vector indictment. Check that the embedding API is actually responding with non-zero floats. Restart workers; if persistent, check API quotas / credentials.
Workspace switched from text-only to multimodal but no images are recallable. Switch only affects new uploads. Existing documents stay text-only-embedded. Either delete-and-reupload, or accept the mixed reality (which violates the no-mix rule — the database constraint will likely have prevented this).
Indexing speed degrades over time. The embedding API has rate limits. QRY batches and back-offs, but at scale you'll hit a ceiling. Either bump the API quota with Google or shard across multiple Gemini projects.
See also
- LLM providers — chat / reasoning models, separate from embeddings.
- Shared files and memory — what users see from the embedding pipeline.
- Dual Embedding System reference — full feature reference.