Skip to main content

Automated Data Profiling

QRY automatically analyzes your database tables to provide rich context and smarter query suggestions, eliminating the need to manually explore data structure and statistics.

Overview

Data profiling is the automated statistical analysis and examination of data to understand its structure, content, and quality. QRY's profiling system runs in the background, enriching conversations with valuable insights about your tables before you even ask.

What Gets Profiled

  • Table Metrics: Row counts, size estimates, last update timestamps
  • Column Analysis: Data types, null percentages, unique value counts
  • Statistical Summaries: Min/max/avg for numeric columns, value distributions
  • Data Quality: Completeness scores, potential data issues
  • Relationships: Foreign key detection and cardinality analysis

How It Works

Automatic Background Profiling

Profiling runs on a configurable schedule without impacting database performance:

  1. Scheduled Jobs: Kubernetes CronJobs execute profiling at off-peak hours
  2. Sampling Strategy: Analyzes representative samples for large tables
  3. Incremental Updates: Only re-profiles tables that have changed
  4. Resource-Aware: Respects database load and query timeouts

Intelligent Context Injection

When you ask a question, profiling data automatically enhances the AI's understanding:

User: "Show me our top customers"

Behind the scenes, AI receives:
- customers table has 1.2M rows
- revenue column ranges from $0 to $500K
- top_customer flag exists (boolean, 5% true)
- last_purchase_date shows 80% active in last 90 days

This context enables smarter, more relevant responses.

Key Benefits

For Business Users

Faster Query Understanding

  • AI suggests filters based on actual data distributions
  • Automatically handles edge cases (nulls, outliers)
  • Provides context about data freshness

Example:

You: "Show me sales by region"

Without Profiling: Generic query, may include test data
With Profiling: AI knows region column has 5 values, excludes NULL regions,
notes 80% of sales are in "North America"

For Data Analysts

Data Quality Insights

  • Identify tables with high null percentages
  • Spot potential data quality issues
  • Understand value distributions before querying

Optimized Query Suggestions

  • AI recommends appropriate aggregations
  • Suggests useful filters based on cardinality
  • Warns about potential performance issues

For Administrators

Automated Discovery

  • New tables automatically profiled
  • No manual configuration required for basic profiling
  • Extensible for custom metrics

Performance Optimization

  • Identify large tables needing optimization
  • Monitor data growth trends
  • Plan database resources proactively

Profiling Configuration

Admin Settings

Administrators can configure profiling per database:

Basic Configuration

Database: production_db
Profiling Enabled: Yes
Schedule: Daily at 2:00 AM UTC
Sample Size: 10,000 rows (for tables > 1M rows)
Timeout: 30 seconds per table

Advanced Options

  • Include/Exclude Patterns: Target specific schemas or tables

    Include: public.*, analytics.*
    Exclude: temp_*, staging.*
  • Column-Level Control: Skip profiling for sensitive columns

    Exclude Columns: ssn, credit_card, password_hash
  • Performance Tuning: Adjust sampling and concurrency

    Max Concurrent Tables: 5
    Sample Percentage: 1% (for very large tables)

Manual Triggers

Run profiling on-demand for specific tables:

  1. Navigate to Admin → Data Profiling
  2. Select database and tables
  3. Click "Profile Now"
  4. Monitor progress in real-time

Profiling Results

Statistics Dashboard

View profiling results in the admin panel:

Table-Level Metrics

  • Total row count and estimated size
  • Number of columns and data types
  • Last profiled timestamp
  • Data freshness indicators

Column-Level Details

  • Data type and nullable status
  • Unique value count (cardinality)
  • Null percentage
  • Min/max/avg for numeric columns
  • Most common values for categorical columns

Example Profiling Output

Table: customers
Rows: 1,247,893
Columns: 15
Last Updated: 2025-10-18 06:15:00
Profiled: 2025-10-18 02:00:00

Column: customer_id (INTEGER, PRIMARY KEY)
- Unique Values: 1,247,893 (100%)
- Nulls: 0%
- Range: 1 to 1247893

Column: signup_date (DATE)
- Unique Values: 1,456
- Nulls: 0.1%
- Range: 2020-01-01 to 2025-10-18
- Distribution: 65% in last 2 years

Column: customer_tier (VARCHAR)
- Unique Values: 4
- Nulls: 2%
- Values: Bronze (45%), Silver (35%), Gold (15%), Platinum (5%)

Column: lifetime_value (DECIMAL)
- Unique Values: 89,234
- Nulls: 0%
- Range: $0.00 to $487,231.15
- Average: $8,452.33
- Median: $3,200.00

Integration with AI

Smarter Query Generation

Profiling data helps the AI make better decisions:

Appropriate Aggregations

Low cardinality (4 values) → GROUP BY suggested
High cardinality (1M values) → TOP N suggested

Filter Recommendations

Column with 90% nulls → Add IS NOT NULL filter
Date column → Suggest recent date range

Performance Warnings

Table with 100M rows → Recommend limiting results
Complex join on high-cardinality columns → Suggest indexes

Enhanced Explanations

AI can explain results with data context:

User: "Why are there so few platinum customers?"

AI: "Based on the profiling data, only 5% of customers (about 62,000) reach
Platinum tier. This appears intentional as the tier requires lifetime value
above $50,000, and the average customer lifetime value is $8,452."

Data Quality Detection

Automatic Issue Identification

Profiling helps spot common data quality problems:

Missing Data

  • Columns with unexpected high null percentages
  • Patterns in missing data (e.g., nulls only for recent dates)

Inconsistent Values

  • Spelling variations (e.g., "New York", "NY", "new york")
  • Out-of-range values for constrained fields
  • Unexpected data types or formats

Data Drift

  • Significant changes in value distributions
  • Sudden spikes or drops in row counts
  • New categories appearing in categorical columns

Quality Alerts

Configure alerts for data quality issues:

Alert: High null percentage
Trigger: Column nulls > 20%
Action: Notify data team

Alert: Row count change
Trigger: Daily row count change > 50%
Action: Create ticket

Alert: New categorical values
Trigger: New values in tier, status, category columns
Action: Log for review

Best Practices

Scheduling

  • Off-Peak Hours: Run profiling when database load is low
  • Frequency: Daily for active tables, weekly for historical tables
  • Incremental: Profile changed tables more frequently

Sampling

  • Small Tables (under 100K rows): Profile all rows for accuracy
  • Medium Tables (100K-10M): Sample 10-100K rows
  • Large Tables (over 10M): Sample 0.1-1% or fixed 100K rows

Performance

  • Parallel Execution: Profile multiple tables concurrently
  • Timeout Limits: Set reasonable timeouts to prevent hanging
  • Resource Limits: Use read replicas if available
  • Index Utilization: Ensure profiling queries use indexes

Data Privacy

  • Sensitive Columns: Exclude PII from profiling
  • Value Samples: Avoid storing actual sensitive values
  • Access Control: Restrict profiling results to authorized users
  • Audit Logging: Track who views profiling data

Use Cases

Sales Analysis

Scenario: Analyzing monthly sales trends

Without Profiling:

  • User asks about sales
  • AI doesn't know typical ranges
  • Returns query with potential outliers

With Profiling:

  • AI knows sales range from $100-$1M
  • Identifies $10M sale as outlier
  • Suggests investigating or excluding it
  • Notes 20% of sales are from repeat customers

Customer Segmentation

Scenario: Understanding customer base

Profiling Reveals:

  • 45% of customers have 1 purchase only
  • Average customer lifetime: 2.3 years
  • Top 10% of customers drive 60% of revenue
  • Peak signup months: January, September

AI Uses This To:

  • Suggest cohort analysis
  • Recommend retention strategies
  • Identify high-value segments
  • Predict churn patterns

Data Migration

Scenario: Moving to new database

Profiling Helps:

  • Validate row counts match
  • Verify value distributions unchanged
  • Identify data type mismatches
  • Confirm relationship integrity

Technical Details

Storage

Profiling results are stored in PostgreSQL tables:

table_profiling_results
- database_id, table_name
- row_count, size_bytes
- profiled_at, metadata

column_profiling_results
- table_id, column_name
- data_type, nullable
- unique_count, null_percentage
- statistics (JSON: min, max, avg, etc.)

API Access

Retrieve profiling data programmatically:

GET /api/profiling/databases/{db_id}/tables/{table_name}

Response:
{
"table_name": "customers",
"row_count": 1247893,
"columns": [
{
"name": "customer_id",
"type": "INTEGER",
"unique_count": 1247893,
"null_percentage": 0.0,
"statistics": {...}
},
...
]
}

FAQ

Q: Does profiling slow down my database? A: No, profiling uses sampling and runs during off-peak hours with strict timeouts.

Q: Can I disable profiling for specific tables? A: Yes, use exclude patterns in the profiling configuration.

Q: How often should I run profiling? A: Daily for active tables, weekly for historical/archive tables.

Q: Is profiling data included in backups? A: Yes, profiling results are part of the QRY database and included in backups.

Q: Can I see historical profiling data? A: Currently only the latest profiling run is stored; historical tracking is on the roadmap.

Next Steps


Automated data profiling ensures QRY has the context needed to provide intelligent, accurate insights about your data.

QRYA product of IXEN.