Automated Data Profiling
QRY automatically analyzes your database tables to provide rich context and smarter query suggestions, eliminating the need to manually explore data structure and statistics.
Overview
Data profiling is the automated statistical analysis and examination of data to understand its structure, content, and quality. QRY's profiling system runs in the background, enriching conversations with valuable insights about your tables before you even ask.
What Gets Profiled
- Table Metrics: Row counts, size estimates, last update timestamps
- Column Analysis: Data types, null percentages, unique value counts
- Statistical Summaries: Min/max/avg for numeric columns, value distributions
- Data Quality: Completeness scores, potential data issues
- Relationships: Foreign key detection and cardinality analysis
How It Works
Automatic Background Profiling
Profiling runs on a configurable schedule without impacting database performance:
- Scheduled Jobs: Kubernetes CronJobs execute profiling at off-peak hours
- Sampling Strategy: Analyzes representative samples for large tables
- Incremental Updates: Only re-profiles tables that have changed
- Resource-Aware: Respects database load and query timeouts
Intelligent Context Injection
When you ask a question, profiling data automatically enhances the AI's understanding:
User: "Show me our top customers"
Behind the scenes, AI receives:
- customers table has 1.2M rows
- revenue column ranges from $0 to $500K
- top_customer flag exists (boolean, 5% true)
- last_purchase_date shows 80% active in last 90 days
This context enables smarter, more relevant responses.
Key Benefits
For Business Users
Faster Query Understanding
- AI suggests filters based on actual data distributions
- Automatically handles edge cases (nulls, outliers)
- Provides context about data freshness
Example:
You: "Show me sales by region"
Without Profiling: Generic query, may include test data
With Profiling: AI knows region column has 5 values, excludes NULL regions,
notes 80% of sales are in "North America"
For Data Analysts
Data Quality Insights
- Identify tables with high null percentages
- Spot potential data quality issues
- Understand value distributions before querying
Optimized Query Suggestions
- AI recommends appropriate aggregations
- Suggests useful filters based on cardinality
- Warns about potential performance issues
For Administrators
Automated Discovery
- New tables automatically profiled
- No manual configuration required for basic profiling
- Extensible for custom metrics
Performance Optimization
- Identify large tables needing optimization
- Monitor data growth trends
- Plan database resources proactively
Profiling Configuration
Admin Settings
Administrators can configure profiling per database:
Basic Configuration
Database: production_db
Profiling Enabled: Yes
Schedule: Daily at 2:00 AM UTC
Sample Size: 10,000 rows (for tables > 1M rows)
Timeout: 30 seconds per table
Advanced Options
-
Include/Exclude Patterns: Target specific schemas or tables
Include: public.*, analytics.*
Exclude: temp_*, staging.* -
Column-Level Control: Skip profiling for sensitive columns
Exclude Columns: ssn, credit_card, password_hash -
Performance Tuning: Adjust sampling and concurrency
Max Concurrent Tables: 5
Sample Percentage: 1% (for very large tables)
Manual Triggers
Run profiling on-demand for specific tables:
- Navigate to Admin → Data Profiling
- Select database and tables
- Click "Profile Now"
- Monitor progress in real-time
Profiling Results
Statistics Dashboard
View profiling results in the admin panel:
Table-Level Metrics
- Total row count and estimated size
- Number of columns and data types
- Last profiled timestamp
- Data freshness indicators
Column-Level Details
- Data type and nullable status
- Unique value count (cardinality)
- Null percentage
- Min/max/avg for numeric columns
- Most common values for categorical columns
Example Profiling Output
Table: customers
Rows: 1,247,893
Columns: 15
Last Updated: 2025-10-18 06:15:00
Profiled: 2025-10-18 02:00:00
Column: customer_id (INTEGER, PRIMARY KEY)
- Unique Values: 1,247,893 (100%)
- Nulls: 0%
- Range: 1 to 1247893
Column: signup_date (DATE)
- Unique Values: 1,456
- Nulls: 0.1%
- Range: 2020-01-01 to 2025-10-18
- Distribution: 65% in last 2 years
Column: customer_tier (VARCHAR)
- Unique Values: 4
- Nulls: 2%
- Values: Bronze (45%), Silver (35%), Gold (15%), Platinum (5%)
Column: lifetime_value (DECIMAL)
- Unique Values: 89,234
- Nulls: 0%
- Range: $0.00 to $487,231.15
- Average: $8,452.33
- Median: $3,200.00
Integration with AI
Smarter Query Generation
Profiling data helps the AI make better decisions:
Appropriate Aggregations
Low cardinality (4 values) → GROUP BY suggested
High cardinality (1M values) → TOP N suggested
Filter Recommendations
Column with 90% nulls → Add IS NOT NULL filter
Date column → Suggest recent date range
Performance Warnings
Table with 100M rows → Recommend limiting results
Complex join on high-cardinality columns → Suggest indexes
Enhanced Explanations
AI can explain results with data context:
User: "Why are there so few platinum customers?"
AI: "Based on the profiling data, only 5% of customers (about 62,000) reach
Platinum tier. This appears intentional as the tier requires lifetime value
above $50,000, and the average customer lifetime value is $8,452."
Data Quality Detection
Automatic Issue Identification
Profiling helps spot common data quality problems:
Missing Data
- Columns with unexpected high null percentages
- Patterns in missing data (e.g., nulls only for recent dates)
Inconsistent Values
- Spelling variations (e.g., "New York", "NY", "new york")
- Out-of-range values for constrained fields
- Unexpected data types or formats
Data Drift
- Significant changes in value distributions
- Sudden spikes or drops in row counts
- New categories appearing in categorical columns
Quality Alerts
Configure alerts for data quality issues:
Alert: High null percentage
Trigger: Column nulls > 20%
Action: Notify data team
Alert: Row count change
Trigger: Daily row count change > 50%
Action: Create ticket
Alert: New categorical values
Trigger: New values in tier, status, category columns
Action: Log for review
Best Practices
Scheduling
- Off-Peak Hours: Run profiling when database load is low
- Frequency: Daily for active tables, weekly for historical tables
- Incremental: Profile changed tables more frequently
Sampling
- Small Tables (under 100K rows): Profile all rows for accuracy
- Medium Tables (100K-10M): Sample 10-100K rows
- Large Tables (over 10M): Sample 0.1-1% or fixed 100K rows
Performance
- Parallel Execution: Profile multiple tables concurrently
- Timeout Limits: Set reasonable timeouts to prevent hanging
- Resource Limits: Use read replicas if available
- Index Utilization: Ensure profiling queries use indexes
Data Privacy
- Sensitive Columns: Exclude PII from profiling
- Value Samples: Avoid storing actual sensitive values
- Access Control: Restrict profiling results to authorized users
- Audit Logging: Track who views profiling data
Use Cases
Sales Analysis
Scenario: Analyzing monthly sales trends
Without Profiling:
- User asks about sales
- AI doesn't know typical ranges
- Returns query with potential outliers
With Profiling:
- AI knows sales range from $100-$1M
- Identifies $10M sale as outlier
- Suggests investigating or excluding it
- Notes 20% of sales are from repeat customers
Customer Segmentation
Scenario: Understanding customer base
Profiling Reveals:
- 45% of customers have 1 purchase only
- Average customer lifetime: 2.3 years
- Top 10% of customers drive 60% of revenue
- Peak signup months: January, September
AI Uses This To:
- Suggest cohort analysis
- Recommend retention strategies
- Identify high-value segments
- Predict churn patterns
Data Migration
Scenario: Moving to new database
Profiling Helps:
- Validate row counts match
- Verify value distributions unchanged
- Identify data type mismatches
- Confirm relationship integrity
Technical Details
Storage
Profiling results are stored in PostgreSQL tables:
table_profiling_results
- database_id, table_name
- row_count, size_bytes
- profiled_at, metadata
column_profiling_results
- table_id, column_name
- data_type, nullable
- unique_count, null_percentage
- statistics (JSON: min, max, avg, etc.)
API Access
Retrieve profiling data programmatically:
GET /api/profiling/databases/{db_id}/tables/{table_name}
Response:
{
"table_name": "customers",
"row_count": 1247893,
"columns": [
{
"name": "customer_id",
"type": "INTEGER",
"unique_count": 1247893,
"null_percentage": 0.0,
"statistics": {...}
},
...
]
}
FAQ
Q: Does profiling slow down my database? A: No, profiling uses sampling and runs during off-peak hours with strict timeouts.
Q: Can I disable profiling for specific tables? A: Yes, use exclude patterns in the profiling configuration.
Q: How often should I run profiling? A: Daily for active tables, weekly for historical/archive tables.
Q: Is profiling data included in backups? A: Yes, profiling results are part of the QRY database and included in backups.
Q: Can I see historical profiling data? A: Currently only the latest profiling run is stored; historical tracking is on the roadmap.
Next Steps
- Explore Workspaces for team collaboration with shared knowledge
- Explore Database Connectivity for connecting new data sources
- See QRY Nexus — profiling data feeds into Data Product quality scoring
- Read the Admin Guide for advanced configuration
Automated data profiling ensures QRY has the context needed to provide intelligent, accurate insights about your data.