Skip to main content

ML Hub

Status: Production Ready

Overview

Remember when building ML pipelines meant months of setup, wrestling with MLflow configs, explaining to your manager why "the model works locally," and begging DevOps for GPU instances? Those days are over.

ML Hub is Qry's in-database machine learning platform that lets you train, experiment, and monitor ML models entirely through conversation. No Jupyter notebooks to maintain, no Docker containers to debug, no "let me just set up the infrastructure first." Describe what you want, train a model, make predictions. Done.

Key Features

Conversational ML Training

Tell Qry what you want to predict, and it handles the rest. "Train a model to predict customer churn" becomes a deployed XGBoost classifier in minutes. Your data science team might get nervous about job security (they shouldn't - they'll just have time for interesting problems now).

Experiment Tracking

Compare algorithms and hyperparameters without drowning in spreadsheets. Run XGBoost vs LightGBM vs Random Forest, see results side-by-side, promote the winner. It's like A/B testing, but for models, and without the three-week debate about statistical significance.

Feature Store

Stop copy-pasting feature definitions across notebooks. Define customer_lifetime_value once, use it everywhere. Your future self will send you a thank-you card.

Model Monitoring

Know when your production model starts predicting nonsense before your users do. Automatic drift detection, performance tracking, two-tier alerts. It's like having a quality assurance team that never sleeps.

Four Core Components

ComponentPurposeThink of it as...
Model RegistryTrain, version, and deploy modelsYour model library
Experiment TrackingCompare approaches scientificallyA/B testing for ML
Feature StoreReusable feature definitionsYour feature encyclopedia
Model MonitoringTrack health and driftA 24/7 model babysitter

Getting Started

  1. Open Qry
  2. Click ML Hub in the left navigation rail
  3. Behold: Models, Experiments, Features, and Monitoring tabs

Training Your First Model

Just ask:

Train an XGBoost model to predict customer churn using the
customer_data table. Use age, tenure, and monthly_charges
as features and churned as the target.

That's it. Qry will:

  1. Create a training job
  2. Handle train/test splits
  3. Train the model
  4. Calculate metrics (accuracy, F1, AUC)
  5. Store it in the registry

No imports, no configs, no debugging why sklearn version 1.3.2 broke everything.

Model Registry

Supported Algorithms

Classification (when you're predicting categories):

AlgorithmBest for
xgboost_classifierMaximum accuracy on tabular data
lightgbm_classifierFast training, handles large datasets gracefully
random_forest_classifierWhen you need to explain the model to executives
logistic_regressionSimple baseline, interpretable coefficients

Regression (when you're predicting numbers):

AlgorithmBest for
xgboost_regressorTabular data supremacy
lightgbm_regressorLarge datasets, handles missing values
random_forest_regressorInterpretability over maximum performance
linear_regressionQuick baseline, easy to explain

Model Lifecycle

STAGING → PRODUCTION → ARCHIVED
↓ ↓
(test it) (one active)
  • Staging: Fresh from training, ready for evaluation
  • Production: Active and serving predictions (only one version at a time)
  • Archived: Retired but preserved for posterity and audits

Model Versioning

Same model name = automatic version increment:

churn-model:v1 → churn-model:v2 → churn-model:v3

Promoting a new version archives the previous one automatically. No manual cleanup, no version conflicts, no "which model is actually in production?" conversations.

Training Models

Basic Training

Train a random forest classifier called customer-segments
using the users table with features [age, income, spend] and
target segment

Training with Custom Hyperparameters

Train a random_forest_classifier called customer-segmentation
with n_estimators=200 and max_depth=15. Use the customer_data
table with features [age, income, purchase_history] and
target segment.

Training from SQL Query

When your training data needs a bit of prep:

Train an xgboost_regressor called revenue-predictor using:
SELECT customer_id, order_count, avg_order_value, days_active,
lifetime_value as target
FROM customer_metrics
WHERE lifetime_value IS NOT NULL

Checking Training Progress

What's the status of my training jobs?
Show logs for training job abc123...

Managing Models

List models:

List all my production models
Show all XGBoost models in staging

Get details:

Get details for the churn-prediction model
Show metrics for model version 2

Promote to production:

Promote churn-prediction v2 to production

Archive or delete:

Archive the old segmentation model
Delete the failed training attempt

Making Predictions

Single Prediction

Use the churn-prediction model to predict if this customer will churn:
{age: 35, tenure: 24, monthly_charges: 75.50, total_charges: 1812.00}

Response includes prediction probability, model version used, and inference latency. Everything you need for debugging and auditing.

Batch Predictions

Use the churn model to predict churn for all active customers.
Query: SELECT age, tenure, monthly_charges, total_charges
FROM customers WHERE status = 'active'

Scheduled Predictions

Because nobody wants to run predictions manually at 9 AM every day:

Schedule daily churn predictions at 9am using the churn model.
Query the active_customers table and send results to
analytics@company.com

Supported schedules:

  • Cron expressions: 0 9 * * *
  • Natural language: every day at 9am, every monday at 8pm

Experiment Tracking

Experiments let you compare algorithms without the chaos of 47 Jupyter notebooks named model_final_v2_REAL_final.ipynb.

Workflow

1. Create experiment with hypothesis

2. Add training runs

3. Compare results

4. Promote winner to production

Creating an Experiment

Create an experiment called 'Churn Model Comparison' with
hypothesis: 'XGBoost will outperform Random Forest for our
customer data'. Target metric: f1

Quick Experiment (The Easy Way)

Let Qry do the heavy lifting:

Run an experiment comparing XGBoost, LightGBM, and Random Forest
for predicting churn. Use the customer_data table with features
[age, tenure, charges] and target churned.

This automatically:

  1. Creates the experiment
  2. Trains all three model types
  3. Compares results
  4. Identifies the best performer

You just asked a question and got a scientific comparison. Try explaining that to someone from 2010.

Comparing Results

Compare all runs in my churn experiment

You get:

  • Side-by-side metrics (accuracy, F1, training time)
  • Best run highlighted
  • Feature importance from each model

Completing and Promoting

Complete the churn experiment and promote the best model to production

Or complete without promoting:

Complete the segmentation experiment

Feature Store

Stop redefining the same features in every notebook. Define once, use everywhere, maintain your sanity.

Core Concepts

ConceptDescriptionExample
EntityPrimary key for lookupscustomer_id, product_id
FeatureSingle computed valuecustomer_lifetime_value
Feature SetCollection of featureschurn_prediction_features

Creating Features

Simple feature:

Create a feature called total_orders that counts customer orders

Detailed definition:

Create a feature named customer_lifetime_value:
- Entity: customer_id
- Type: numeric
- Expression: SUM(amount) FROM orders WHERE customer_id = entity.customer_id
- Source table: orders
- Tags: [customer, financial]

Feature Types

TypeUse for
numericNumbers, amounts, counts
categoricalLabels, categories, segments
booleanYes/no, true/false
textFree-form text
datetimeDates and times
arrayLists of values

Feature Status

  • Draft: Work in progress
  • Active: Ready for use in training
  • Deprecated: Avoid in new models
  • Archived: Historical reference only

Feature Sets

Group features for consistent training:

Create a feature set called churn_prediction_features with:
- customer_lifetime_value
- customer_order_count
- days_since_last_order
- customer_total_spend

Use customer_id as the entity key

Now train with:

Train an XGBoost model using the churn_prediction_features
feature set with target churned

Consistent features across every model. No more "wait, did we include tenure in this one?"

Model Monitoring

Production models are like houseplants - they need regular attention or they die quietly.

Monitor Types

TypeWhat it catches
performanceAccuracy degradation over time
data_driftInput feature distributions shifting
prediction_driftOutput predictions changing unexpectedly
data_qualityNull rates, out-of-range values

Creating Monitors

Performance monitor:

Create a performance monitor for my churn-production model
that alerts if accuracy drops below 0.90

Data drift monitor:

Set up a data drift monitor on customer-churn checking
every 2 hours for features [age, tenure, monthly_charges]

With thresholds:

Create a performance monitor for revenue-predictor with
warning threshold at 0.85 and critical threshold at 0.80

Alert System

Two-tier alerts so you know what's urgent:

  • Warning: "Hey, might want to look at this"
  • Critical: "Wake up, something's wrong"

Alert lifecycle: OPENACKNOWLEDGEDRESOLVED

Managing Alerts

List open alerts for my models
Show critical alerts from the past 24 hours
Acknowledge alert xyz... - we're investigating
Resolve alert abc... - model was retrained

ML Hub UI

Navigate to ML Hub in Qry for visual management.

Models Tab

  • View all models with status badges (staging, production, archived)
  • Search and filter by status/type
  • Bulk select for delete/archive operations
  • Click any model for metrics, versions, and deployment history

Experiments Tab

  • Grid view of all experiments
  • Status indicators (running, completed, abandoned)
  • Run count and best result preview
  • Click to view detailed run comparison charts

Features Tab

  • Toggle between Features and Feature Sets
  • View feature definitions and usage statistics
  • Check feature status and entity columns
  • Click for full SQL expression and lineage

Monitoring Tab

  • Dashboard with health summary cards
  • Open alerts with Acknowledge/Resolve buttons
  • Monitors table with current status
  • Last check time and health indicators

Best Practices

Model Training

  1. Start with XGBoost or LightGBM - they handle tabular data exceptionally well
  2. Use at least 1,000 rows - less than that and you're training on noise
  3. Run experiments first - compare before promoting to production
  4. Document model purpose - future you will appreciate it

Feature Store

  1. Name descriptively - customer_lifetime_value beats f1
  2. Use consistent entity columns - standardize on customer_id, not cust_id, customerID
  3. Add tags generously - makes discovery much easier
  4. Monitor feature statistics - catch data quality issues early

Experiments

  1. Set clear hypotheses - "XGBoost will outperform" not "let's try stuff"
  2. Choose appropriate metrics - F1 for imbalanced classes, RMSE for regression
  3. Test 3-5 configurations - enough to compare, not so many you can't analyze
  4. Only promote when exceeding baseline - don't deploy for the sake of deploying

Monitoring

  1. Create monitors immediately after production promotion - not next week
  2. Set thresholds based on business impact - what degradation level actually matters?
  3. Monitor data quality AND performance - bad data causes bad predictions
  4. Act on alerts quickly - they're not decorations

Quick Reference

Training

Train [model_type] called [name] from [table] with features [columns] target [column]
Check status of my training jobs
Get details for [model_name]

Predictions

Predict with [model_name] for {feature: value, ...}
Batch predict [model_name] from [query]
Schedule predictions for [model_name] [schedule] to [emails]

Experiments

Create experiment [name] with hypothesis [text]
Add [model_type] run to [experiment] with [parameters]
Compare runs in [experiment]
Complete [experiment] and promote best

Features

Create feature [name] that [description]
Create feature set [name] with features [list]
List active features
Fetch [feature_set] for entities [list]

Monitoring

Create [type] monitor for [model_name]
Get health summary for [model_name]
List open alerts
Resolve alert [id] - [notes]

Troubleshooting

Training Job Failed

Check logs:

Show logs for training job xyz...

Common causes:

IssueSolution
Query returns no dataVerify your SQL, check table names
Missing feature columnsDouble-check column names exist
Data too largeReduce dataset size or increase timeout
Invalid hyperparametersStart with defaults, then customize

Predictions Are Inaccurate

  1. Check for data drift: Get health summary for [model]
  2. Verify input features match training distribution
  3. Review training vs production metrics
  4. Consider retraining with recent data

Feature Computation Is Slow

  • Materialize frequently-used feature sets
  • Index entity columns in source tables
  • Simplify complex SQL expressions
  • Use caching for repeated lookups

Monitor Not Triggering

  • Verify monitor is ACTIVE status
  • Check threshold values are realistic
  • Confirm check interval is appropriate
  • Look for errors in monitor logs

See Also


Last updated: December 2025