Automated Data Profiling

QRY automatically analyzes your database tables to provide rich context and smarter query suggestions, eliminating the need to manually explore data structure and statistics.

Overview

Data profiling is the automated statistical analysis and examination of data to understand its structure, content, and quality. QRY's profiling system runs in the background, enriching conversations with valuable insights about your tables before you even ask.

What Gets Profiled

Table Metrics: Row counts, size estimates, last update timestamps
Column Analysis: Data types, null percentages, unique value counts
Statistical Summaries: Min/max/avg for numeric columns, value distributions
Data Quality: Completeness scores, potential data issues
Relationships: Foreign key detection and cardinality analysis

How It Works

Automatic Background Profiling

Profiling runs on a configurable schedule without impacting database performance:

Scheduled Jobs: Kubernetes CronJobs execute profiling at off-peak hours
Sampling Strategy: Analyzes representative samples for large tables
Incremental Updates: Only re-profiles tables that have changed
Resource-Aware: Respects database load and query timeouts

Intelligent Context Injection

When you ask a question, profiling data automatically enhances the AI's understanding:

User: "Show me our top customers"

Behind the scenes, AI receives:
- customers table has 1.2M rows
- revenue column ranges from $0 to $500K
- top_customer flag exists (boolean, 5% true)
- last_purchase_date shows 80% active in last 90 days

This context enables smarter, more relevant responses.

Key Benefits

For Business Users

Faster Query Understanding

AI suggests filters based on actual data distributions
Automatically handles edge cases (nulls, outliers)
Provides context about data freshness

Example:

You: "Show me sales by region"

Without Profiling: Generic query, may include test data
With Profiling: AI knows region column has 5 values, excludes NULL regions,
notes 80% of sales are in "North America"

For Data Analysts

Data Quality Insights

Identify tables with high null percentages
Spot potential data quality issues
Understand value distributions before querying

Optimized Query Suggestions

AI recommends appropriate aggregations
Suggests useful filters based on cardinality
Warns about potential performance issues

For Administrators

Automated Discovery

New tables automatically profiled
No manual configuration required for basic profiling
Extensible for custom metrics

Performance Optimization

Identify large tables needing optimization
Monitor data growth trends
Plan database resources proactively

Profiling Configuration

Admin Settings

Administrators can configure profiling per database:

Basic Configuration

Database: production_db
Profiling Enabled: Yes
Schedule: Daily at 2:00 AM UTC
Sample Size: 10,000 rows (for tables > 1M rows)
Timeout: 30 seconds per table

Advanced Options

Include/Exclude Patterns: Target specific schemas or tables
```
Include: public.*, analytics.*
Exclude: temp_*, staging.*
```
Column-Level Control: Skip profiling for sensitive columns
```
Exclude Columns: ssn, credit_card, password_hash
```

Performance Tuning: Adjust sampling and concurrency

Max Concurrent Tables: 5
Sample Percentage: 1% (for very large tables)

Manual Triggers

Run profiling on-demand for specific tables:

Navigate to Admin → Data Profiling
Select database and tables
Click "Profile Now"
Monitor progress in real-time

Profiling Results

Statistics Dashboard

View profiling results in the admin panel:

Table-Level Metrics

Total row count and estimated size
Number of columns and data types
Last profiled timestamp
Data freshness indicators

Column-Level Details

Data type and nullable status
Unique value count (cardinality)
Null percentage
Min/max/avg for numeric columns
Most common values for categorical columns

Example Profiling Output

Table: customers
Rows: 1,247,893
Columns: 15
Last Updated: 2025-10-18 06:15:00
Profiled: 2025-10-18 02:00:00

Column: customer_id (INTEGER, PRIMARY KEY)
- Unique Values: 1,247,893 (100%)
- Nulls: 0%
- Range: 1 to 1247893

Column: signup_date (DATE)
- Unique Values: 1,456
- Nulls: 0.1%
- Range: 2020-01-01 to 2025-10-18
- Distribution: 65% in last 2 years

Column: customer_tier (VARCHAR)
- Unique Values: 4
- Nulls: 2%
- Values: Bronze (45%), Silver (35%), Gold (15%), Platinum (5%)

Column: lifetime_value (DECIMAL)
- Unique Values: 89,234
- Nulls: 0%
- Range: $0.00 to $487,231.15
- Average: $8,452.33
- Median: $3,200.00

Integration with AI

Smarter Query Generation

Profiling data helps the AI make better decisions:

Appropriate Aggregations

Low cardinality (4 values) → GROUP BY suggested
High cardinality (1M values) → TOP N suggested

Filter Recommendations

Column with 90% nulls → Add IS NOT NULL filter
Date column → Suggest recent date range

Performance Warnings

Table with 100M rows → Recommend limiting results
Complex join on high-cardinality columns → Suggest indexes

Enhanced Explanations

AI can explain results with data context:

User: "Why are there so few platinum customers?"

AI: "Based on the profiling data, only 5% of customers (about 62,000) reach
Platinum tier. This appears intentional as the tier requires lifetime value
above $50,000, and the average customer lifetime value is $8,452."

Data Quality Detection

Automatic Issue Identification

Profiling helps spot common data quality problems:

Missing Data

Columns with unexpected high null percentages
Patterns in missing data (e.g., nulls only for recent dates)

Inconsistent Values

Spelling variations (e.g., "New York", "NY", "new york")
Out-of-range values for constrained fields
Unexpected data types or formats

Data Drift

Significant changes in value distributions
Sudden spikes or drops in row counts
New categories appearing in categorical columns

Quality Alerts

Configure alerts for data quality issues:

Alert: High null percentage
Trigger: Column nulls > 20%
Action: Notify data team

Alert: Row count change
Trigger: Daily row count change > 50%
Action: Create ticket

Alert: New categorical values
Trigger: New values in tier, status, category columns
Action: Log for review

Best Practices

Scheduling

Off-Peak Hours: Run profiling when database load is low
Frequency: Daily for active tables, weekly for historical tables
Incremental: Profile changed tables more frequently

Sampling

Small Tables (under 100K rows): Profile all rows for accuracy
Medium Tables (100K-10M): Sample 10-100K rows
Large Tables (over 10M): Sample 0.1-1% or fixed 100K rows

Performance

Parallel Execution: Profile multiple tables concurrently
Timeout Limits: Set reasonable timeouts to prevent hanging
Resource Limits: Use read replicas if available
Index Utilization: Ensure profiling queries use indexes

Data Privacy

Sensitive Columns: Exclude PII from profiling
Value Samples: Avoid storing actual sensitive values
Access Control: Restrict profiling results to authorized users
Audit Logging: Track who views profiling data

Use Cases

Sales Analysis

Scenario: Analyzing monthly sales trends

Without Profiling:

User asks about sales
AI doesn't know typical ranges
Returns query with potential outliers

With Profiling:

AI knows sales range from $100-$1M
Identifies $10M sale as outlier
Suggests investigating or excluding it
Notes 20% of sales are from repeat customers

Customer Segmentation

Scenario: Understanding customer base

Profiling Reveals:

45% of customers have 1 purchase only
Average customer lifetime: 2.3 years
Top 10% of customers drive 60% of revenue
Peak signup months: January, September

AI Uses This To:

Suggest cohort analysis
Recommend retention strategies
Identify high-value segments
Predict churn patterns

Data Migration

Scenario: Moving to new database

Profiling Helps:

Validate row counts match
Verify value distributions unchanged
Identify data type mismatches
Confirm relationship integrity

Technical Details

Storage

Profiling results are stored in PostgreSQL tables:

table_profiling_results
- database_id, table_name
- row_count, size_bytes
- profiled_at, metadata

column_profiling_results
- table_id, column_name
- data_type, nullable
- unique_count, null_percentage
- statistics (JSON: min, max, avg, etc.)

API Access

Retrieve profiling data programmatically:

GET /api/profiling/databases/{db_id}/tables/{table_name}

Response:
{
  "table_name": "customers",
  "row_count": 1247893,
  "columns": [
    {
      "name": "customer_id",
      "type": "INTEGER",
      "unique_count": 1247893,
      "null_percentage": 0.0,
      "statistics": {...}
    },
    ...
  ]
}

FAQ

Q: Does profiling slow down my database? A: No, profiling uses sampling and runs during off-peak hours with strict timeouts.

Q: Can I disable profiling for specific tables? A: Yes, use exclude patterns in the profiling configuration.

Q: How often should I run profiling? A: Daily for active tables, weekly for historical/archive tables.

Q: Is profiling data included in backups? A: Yes, profiling results are part of the QRY database and included in backups.

Q: Can I see historical profiling data? A: Currently only the latest profiling run is stored; historical tracking is on the roadmap.

Next Steps

Explore Workspaces for team collaboration with shared knowledge
Explore Database Connectivity for connecting new data sources
See QRY Nexus — profiling data feeds into Data Product quality scoring
Read the Admin Guide for advanced configuration

Automated data profiling ensures QRY has the context needed to provide intelligent, accurate insights about your data.

Overview​

What Gets Profiled​

How It Works​

Automatic Background Profiling​

Intelligent Context Injection​

Key Benefits​

For Business Users​

For Data Analysts​

For Administrators​

Profiling Configuration​

Admin Settings​

Basic Configuration​

Advanced Options​

Manual Triggers​

Profiling Results​

Statistics Dashboard​

Example Profiling Output​

Integration with AI​

Smarter Query Generation​

Enhanced Explanations​

Data Quality Detection​

Automatic Issue Identification​

Quality Alerts​

Best Practices​

Scheduling​

Sampling​

Performance​

Data Privacy​

Use Cases​

Sales Analysis​

Customer Segmentation​

Data Migration​

Technical Details​

Storage​

API Access​

FAQ​

Next Steps​