Speech-to-text providers
The mic icon in the chat prompt is a feature most users discover only when they need it. For it to work, an admin has to configure a speech-to-text (STT) provider. Until then, the icon is disabled with a "Speech provider not configured" tooltip.
Architecture
Audio is captured in the user's browser at 16 kHz mono with VAD (voice activity detection) — QRY auto-ends the recording on long pauses. The audio is streamed to the QRY backend, which forwards it to the configured STT provider, gets a transcription back, and inserts the text into the user's prompt input.
The backend doesn't retain audio by default — only the transcription becomes part of the conversation. If your tenant has a regulatory need to retain audio, that's a per-tenant config in Admin > System Settings > Speech > Retention.
Where to configure
Admin > System Settings > Speech. Pick a provider, fill in credentials, save.
Supported providers
The exact list depends on your tenant's deployment. Common ones:
- Google Cloud Speech-to-Text — service account JSON, supports streaming. Default for Workspace tenants.
- OpenAI Whisper API — API key, batch transcription per audio clip.
- Self-hosted Whisper — point to your own Whisper deployment via URL + token. Works offline / on-prem.
- Azure Speech — subscription key + region.
For each provider:
- Credentials — the per-provider auth.
- Default language — the locale for transcription if user's browser locale isn't conclusive.
- Auto-detect language — provider-side toggle (most providers support this).
- Profanity filter — provider-side toggle for masking.
Default language
If your tenant is multilingual (Spanish + English in the same team), set auto-detect on if the provider supports it. Otherwise pick the dominant language as default and tell users they can switch in Settings > Voice.
Latency expectations
| Provider | Typical latency for a 5-second clip |
|---|---|
| Google Cloud STT (streaming) | ~1.5s end-to-end |
| OpenAI Whisper API | ~3s |
| Self-hosted Whisper (large) | depends on GPU; 1–4s |
| Azure Speech | ~2s |
These are best-case. Long clips, busy provider regions, or high backend load all add to it.
Cost
Per minute of audio. For a typical tenant where voice is used occasionally, costs are negligible. For a tenant where the team uses voice constantly, cost scales linearly — sample your usage in the first month and budget accordingly.
Disabling the feature
If your tenant doesn't want voice input at all, leave the speech provider unconfigured. The mic icon stays disabled, and the Voice input feature isn't accessible.
Common issues
Mic icon is enabled but transcription is empty / garbled. Browser captured the audio but it's silent or unclear. Confirm with the user that their mic actually works (some macs have multiple inputs and the wrong one is selected).
Wrong language detected. Auto-detect doesn't always nail it on short clips. Either set the user's Voice preference manually in Settings, or disable auto-detect and force a specific default.
Provider returns 401 / 403 from the backend.
Credentials expired or scope is wrong. For Google Cloud, the service account needs speech.client role. For OpenAI, the key has to be active and not rate-limited.
Latency feels longer than the table above suggests. Backend load (Celery workers busy with other tasks) or network egress from the cluster. The provider-side latency is usually small.
User's browser keeps asking for mic permission. Browser permission isn't sticky for the tenant URL. The user (not the admin) controls this — they have to grant permission permanently in their browser settings.
See also
- Voice input — user-facing walkthrough.
- LLM providers — separate from STT, but similar config pattern.
- Speech-to-text reference — full feature reference.