LLM providers
QRY supports multiple LLM providers in parallel. Each conversation, notebook cell, dashboard tile, scheduled task, and Forge translation can target a different provider / model. This page covers the admin side: registering providers, mapping models to providers, and the operational concerns.
Supported providers
- Anthropic Claude — recommended default for chat and reasoning.
- Google Gemini — strong multimodal support; Gemini-native Python executor for fast / cheap code execution.
- OpenAI — GPT-4o, GPT-4 Turbo.
You don't have to pick one — register all three and let users (or per-feature config) decide.
Configuring a provider
In Admin > System Settings > LLM Providers, click + Add Provider for each.
Anthropic Claude
- API key — from the Anthropic console.
- Available models —
claude-haiku-*,claude-sonnet-*,claude-opus-*. Tick the ones you want exposed to users. - Default rate limit — Anthropic's per-org cap; QRY respects it via exponential backoff on 429s.
Google Gemini
- API key OR service-account JSON — service account is what Workspace tenants typically have.
- Available models —
gemini-flash-*,gemini-pro-*. Same tick-list pattern. - Project + location — for Vertex AI deployments instead of public Gemini API.
OpenAI
- API key — from the OpenAI dashboard.
- Organization id — optional, used for billing attribution.
- Available models —
gpt-4o,gpt-4-turbo, etc.
Click Save. Each provider is tested with a tiny inference call to catch credential / connectivity issues immediately.
Per-feature routing
Different features have different requirements. Admin > System Settings > LLM Providers > Routing:
| Feature | Recommended default |
|---|---|
| Chat | Claude Sonnet |
| Notebook cells | Claude Haiku for routine, Sonnet for synthesis (per-cell choice) |
| Dashboard AI assistant | Gemini Pro (multimodal helps with screenshot prompts) |
| Forge translation | Claude Sonnet (best dialect translation) |
| Domain agents | Claude (native tool support) |
| ML training planning | Any |
| Scheduled task summary | Cheapest available |
Each row in the routing config has: feature, primary provider, model, fallback provider, fallback model. Fallback fires when the primary returns a 5xx or rate-limit error.
Resilience
QRY handles provider outages with three mechanisms:
Exponential backoff on 429 / 503
The LLM service retries with exponential backoff and jitter. Anthropic in particular returns 529 (overloaded) sometimes; same retry path.
Provider fallback
If the primary provider is down, the configured fallback takes over. Per-feature, so a Claude outage doesn't break Gemini-routed features.
Compaction control
Long conversations trigger automatic compaction (Anthropic SDK + QRY's overrides). Defaults:
- Layer 0 (truncation) —
_truncate_tool_result_for_historytrims large tool outputs to metadata + 5 sample rows. Default 8KB. - Layer 1 (server-side) — Anthropic's
clear_tool_uses_20250919drops old tool_use/tool_result pairs. Default 120K tokens. - Layer 2 (client-side) — SDK
CompactionControlsummarises the conversation in one extra API call. Default 150K tokens. Only fires when Layer 1 isn't enough.
Configurable in runtime_config.ContextOverflowConfig. Most tenants don't need to touch the defaults.
Cost controls
Cost overruns happen. Two controls:
- Per-user / per-group quotas — daily and rolling-30-day spend caps. See Task Quota Templates.
- Per-feature model budget — cap which models are usable for which feature. "Forge translations only on Sonnet, no Opus".
For tenants where a single user could plausibly burn $1000 of inference in a day, set the per-user cap aggressively at first and relax based on observed usage.
Common issues
429 errors despite low traffic. Provider's per-org cap is shared across tenants on the same Anthropic / Gemini / OpenAI account. Bump the cap with the provider, or split into multiple billing accounts.
Compaction fires too often (extra cost). Raise the Layer 1 / Layer 2 thresholds, or cut the conversation history more aggressively at Layer 0 (smaller truncation budget).
A user reports "weird" answers after model switch. Switching providers mid-conversation can produce drift because the new model lacks the previous context's reasoning style. Restart the conversation if behaviour matters.
Vertex AI Gemini key works locally but not from cluster.
Service-account permissions on the GCP project. The cluster pod's identity needs aiplatform.user role on the project hosting Gemini.
Restart Celery workers after embedding config change. Specifically the embedding service caches its config — see Embedding configuration. Pure provider switches don't usually need a restart, but a service / model change for embeddings does.
See also
- Embedding configuration — embedding model is separate from chat models.
- License management — feature flags / capability flags from the GCP license affect what providers users can use.
- Claude Context Overflow Prevention reference — Layer 0/1/2 compaction details.