Skip to main content

LLM providers

QRY supports multiple LLM providers in parallel. Each conversation, notebook cell, dashboard tile, scheduled task, and Forge translation can target a different provider / model. This page covers the admin side: registering providers, mapping models to providers, and the operational concerns.

Supported providers

  • Anthropic Claude — recommended default for chat and reasoning.
  • Google Gemini — strong multimodal support; Gemini-native Python executor for fast / cheap code execution.
  • OpenAI — GPT-4o, GPT-4 Turbo.

You don't have to pick one — register all three and let users (or per-feature config) decide.

Configuring a provider

In Admin > System Settings > LLM Providers, click + Add Provider for each.

Anthropic Claude

  • API key — from the Anthropic console.
  • Available modelsclaude-haiku-*, claude-sonnet-*, claude-opus-*. Tick the ones you want exposed to users.
  • Default rate limit — Anthropic's per-org cap; QRY respects it via exponential backoff on 429s.

Google Gemini

  • API key OR service-account JSON — service account is what Workspace tenants typically have.
  • Available modelsgemini-flash-*, gemini-pro-*. Same tick-list pattern.
  • Project + location — for Vertex AI deployments instead of public Gemini API.

OpenAI

  • API key — from the OpenAI dashboard.
  • Organization id — optional, used for billing attribution.
  • Available modelsgpt-4o, gpt-4-turbo, etc.

Click Save. Each provider is tested with a tiny inference call to catch credential / connectivity issues immediately.

Per-feature routing

Different features have different requirements. Admin > System Settings > LLM Providers > Routing:

FeatureRecommended default
ChatClaude Sonnet
Notebook cellsClaude Haiku for routine, Sonnet for synthesis (per-cell choice)
Dashboard AI assistantGemini Pro (multimodal helps with screenshot prompts)
Forge translationClaude Sonnet (best dialect translation)
Domain agentsClaude (native tool support)
ML training planningAny
Scheduled task summaryCheapest available

Each row in the routing config has: feature, primary provider, model, fallback provider, fallback model. Fallback fires when the primary returns a 5xx or rate-limit error.

Resilience

QRY handles provider outages with three mechanisms:

Exponential backoff on 429 / 503

The LLM service retries with exponential backoff and jitter. Anthropic in particular returns 529 (overloaded) sometimes; same retry path.

Provider fallback

If the primary provider is down, the configured fallback takes over. Per-feature, so a Claude outage doesn't break Gemini-routed features.

Compaction control

Long conversations trigger automatic compaction (Anthropic SDK + QRY's overrides). Defaults:

  • Layer 0 (truncation)_truncate_tool_result_for_history trims large tool outputs to metadata + 5 sample rows. Default 8KB.
  • Layer 1 (server-side) — Anthropic's clear_tool_uses_20250919 drops old tool_use/tool_result pairs. Default 120K tokens.
  • Layer 2 (client-side) — SDK CompactionControl summarises the conversation in one extra API call. Default 150K tokens. Only fires when Layer 1 isn't enough.

Configurable in runtime_config.ContextOverflowConfig. Most tenants don't need to touch the defaults.

Cost controls

Cost overruns happen. Two controls:

  • Per-user / per-group quotas — daily and rolling-30-day spend caps. See Task Quota Templates.
  • Per-feature model budget — cap which models are usable for which feature. "Forge translations only on Sonnet, no Opus".

For tenants where a single user could plausibly burn $1000 of inference in a day, set the per-user cap aggressively at first and relax based on observed usage.

Common issues

429 errors despite low traffic. Provider's per-org cap is shared across tenants on the same Anthropic / Gemini / OpenAI account. Bump the cap with the provider, or split into multiple billing accounts.

Compaction fires too often (extra cost). Raise the Layer 1 / Layer 2 thresholds, or cut the conversation history more aggressively at Layer 0 (smaller truncation budget).

A user reports "weird" answers after model switch. Switching providers mid-conversation can produce drift because the new model lacks the previous context's reasoning style. Restart the conversation if behaviour matters.

Vertex AI Gemini key works locally but not from cluster. Service-account permissions on the GCP project. The cluster pod's identity needs aiplatform.user role on the project hosting Gemini.

Restart Celery workers after embedding config change. Specifically the embedding service caches its config — see Embedding configuration. Pure provider switches don't usually need a restart, but a service / model change for embeddings does.

See also

QRYA product of IXEN.