Skip to main content

DataHub Integration

Ever wish your AI assistant knew what "churn_score" actually means in your business? Or that "revenue_adjusted" uses your company's specific calculation method? With Qry's DataHub integration, it does—automatically.

Overview

DataHub is an enterprise metadata platform where data teams document their tables, fields, and business logic. Your business analysts and data stewards fill in definitions, tag important fields, and document data lineage—all in DataHub. Qry reads this rich context and uses it to provide smarter, more accurate answers.

What Qry Gets from DataHub

  • Field Descriptions: Business-friendly definitions written by your domain experts
  • Table Documentation: Purpose, ownership, and usage guidelines
  • Tags & Classifications: PII markers, domain assignments, data quality flags
  • Data Lineage: Where data comes from and where it flows to
  • Usage Analytics: Which tables and fields are most frequently queried
  • Ownership Information: Who to ask when questions arise

The result? Your AI understands not just the structure of your data, but the meaning behind it.

How It Works

Seamless Background Sync

When you ask a question in Qry:

  1. Query Analysis: AI identifies which tables and columns you're asking about
  2. Metadata Lookup: Qry fetches relevant definitions from DataHub
  3. Context Injection: Business definitions enrich the AI's understanding
  4. Smart Response: AI generates queries using your organization's terminology

This happens automatically—no extra steps required.

Example in Action

Without DataHub:

You: "Show me high-risk customers"

AI: "I see a 'risk_score' column. I'll assume higher values mean higher risk."
[Returns incorrect results because lower scores actually indicate higher risk in your system]

With DataHub:

You: "Show me high-risk customers"

AI: "According to DataHub, 'risk_score' ranges from 1-100 where 1-20 indicates high risk.
I'll filter for risk_score <= 20."
[Returns correct results using your business logic]

Key Features

Business Glossary Integration

Your data glossary lives in DataHub, curated by business users. Qry automatically applies these definitions:

DataHub Entry:

Field: lifetime_value
Description: Total revenue from customer, including recurring subscriptions,
one-time purchases, and upgrade fees. Excludes refunds and credits.
Formula: SUM(transactions.amount) WHERE transaction_type IN ('purchase', 'subscription', 'upgrade')
Business Owner: Finance Team
Last Updated: 2025-10-15

Qry Uses This:

You: "What's the lifetime value of our top 100 customers?"

AI: [Understands exact calculation method]
[Knows to exclude refunds automatically]
[Applies correct filters based on business definition]

Field-Level Context

Every column gets enriched metadata:

  • Data Type & Format: Beyond SQL types—business formats (e.g., "ISO 8601 timestamps")
  • Validation Rules: Expected ranges, enum values, null handling
  • Sensitive Data Flags: PII, confidential, or regulated data markers
  • Calculation Logic: For derived fields and metrics

Table Lineage Awareness

Qry understands your data flows:

Upstream:
raw_events → cleaned_events → customer_sessions → customer_metrics

Downstream:
customer_metrics → executive_dashboard
→ ml_churn_model
→ weekly_reports

Why This Matters:

  • AI knows which tables are source-of-truth vs. derived
  • Can explain where data originates
  • Suggests related tables you might need
  • Warns about stale or deprecated tables

Usage Statistics

DataHub tracks how often tables and fields are queried across your organization. Qry leverages this:

  • Popular Fields: Suggests commonly-used columns
  • Rare Fields: Warns about uncommonly queried data (potential quality issues)
  • Query Patterns: Learns from how others use the data
  • Active vs. Stale: Identifies tables that haven't been used recently

Configuration

Prerequisites

Your organization needs:

  1. DataHub Instance: Running and accessible (Cloud or self-hosted)
  2. Metadata Ingestion: Tables and fields documented in DataHub
  3. API Access: Qry needs DataHub API credentials

Admin Setup

Administrators configure the integration once for the entire organization:

Admin Settings → Integrations → DataHub

DataHub GMS URL: https://datahub.yourcompany.com
API Token: [Your DataHub API token]
Enabled: Yes

Supported Platforms:
✓ BigQuery
✓ Starburst/Trino
✓ PostgreSQL
✓ Snowflake
✓ Databricks
✓ Redshift
✓ Salesforce
info

The integration is read-only. Qry never modifies your DataHub metadata.

Sync Behavior

  • Real-Time: Metadata fetched on-demand during queries
  • Caching: Recent lookups cached for 15 minutes
  • Graceful Fallback: If DataHub is unavailable, Qry falls back to database schema only
  • No Blocking: Queries never fail due to DataHub issues

Use Cases

Finance & Accounting

Challenge: Complex metric definitions with regulatory requirements

DataHub Stores:

  • GAAP vs. non-GAAP revenue recognition rules
  • Expense categorization logic
  • Reconciliation formulas
  • Compliance documentation

Result: AI applies correct accounting standards automatically

You: "Show me non-GAAP revenue for Q3"

AI: [Uses DataHub's definition: excludes stock-based comp, acquisition costs]
[Applies correct adjustments per your CFO's specifications]
[Notes any changes from previous quarters]

Marketing Analytics

Challenge: Custom UTM parameters and attribution models

DataHub Stores:

  • Campaign naming conventions
  • Attribution window definitions (7-day vs. 30-day)
  • Conversion event taxonomy
  • Channel grouping rules

Result: Consistent attribution across all analyses

You: "What's our CAC by channel?"

AI: [Knows channel_grouping logic from DataHub]
[Applies 7-day attribution window per marketing team's standard]
[Calculates CAC using documented formula]

Product Analytics

Challenge: Event schemas evolve; old and new formats coexist

DataHub Documents:

  • Event schema versions and breaking changes
  • Deprecated events and their replacements
  • Property migrations (e.g., "user_id" → "user_uuid")
  • Platform-specific event nuances

Result: AI navigates schema evolution correctly

You: "Track user signups over time"

AI: [Sees in DataHub: 'user_created' replaced 'signup' in June 2025]
[Unions both events for historical continuity]
[Handles property name changes automatically]

Best Practices

In DataHub

Do:

  • Write field descriptions in plain business language
  • Include calculation formulas for derived metrics
  • Tag PII and sensitive data clearly
  • Keep ownership information current
  • Document any edge cases or gotchas

Don't:

  • Use technical jargon only engineers understand
  • Leave descriptions vague ("customer data")
  • Document incorrect or outdated definitions
  • Forget to update after schema changes

Example Good Documentation

Field: churn_risk_score

Description:
Predictive score (0-100) indicating likelihood of customer cancellation
within next 90 days. Higher scores = higher risk.

Calculation:
ML model (xgboost_v3) trained on historical churn patterns.
Input features: engagement metrics, support tickets, payment history.
Updated daily at 3am UTC.

Business Rules:
- Scores below 30: Low risk (no action)
- Scores 30-60: Medium risk (monitor)
- Scores above 60: High risk (intervention recommended)

Owner: Customer Success Analytics Team
Last Model Update: 2025-09-12

In Qry

Leverage the Integration:

  • Ask questions using business terms, not just column names
  • Reference DataHub documentation in follow-ups
  • Use "explain" requests to see applied business logic
  • Report definition gaps to your data steward team

Privacy & Security

Data Protection

  • No Data Storage: Qry only reads metadata, never actual data values
  • Access Control: Users only see metadata for tables they have permission to query
  • Audit Logging: All DataHub lookups logged for compliance
  • Encrypted Transit: All API calls use HTTPS/TLS

Sensitive Data Handling

DataHub tags like PII, PHI, Confidential are respected:

  • AI warns when querying sensitive fields
  • Suggests anonymization or aggregation
  • Prevents logging of sensitive column values
  • Enforces additional confirmation for PII queries

Troubleshooting

Q: The AI isn't using my DataHub definitions

A: Check:

  1. DataHub integration is enabled in Admin Settings
  2. Your database platform is supported and correctly mapped
  3. Tables/schemas are documented in DataHub with correct URNs
  4. DataHub API token has read permissions

Q: Metadata seems outdated

A:

  • Qry caches metadata for 15 minutes
  • Update DataHub, then wait 15 min or restart Qry session
  • Verify DataHub ingestion pipeline is running

Q: Some tables show metadata, others don't

A:

  • Check DataHub ingestion coverage—not all tables may be ingested
  • Verify table naming matches exactly (case-sensitive)
  • Some legacy tables may predate DataHub documentation efforts

Q: Can Qry write back to DataHub?

A: Not currently. Qry is read-only. To update definitions, use DataHub's UI or API directly.

Technical Details

Platform Mapping

Qry automatically maps database types to DataHub platforms:

Qry Database TypeDataHub Platform
BigQuerybigquery
Starbursttrino
PostgreSQLpostgres
Snowflakesnowflake
Databricksdatabricks
Redshiftredshift
Salesforcesalesforce

URN Construction

Tables are identified using DataHub URNs:

urn:li:dataset:(urn:li:dataPlatform:{platform},{catalog}.{schema}.{table},PROD)

Example:
urn:li:dataset:(urn:li:dataPlatform:bigquery,my-project.analytics.customers,PROD)

API Calls

Typical metadata retrieval:

# Get field descriptions
GET /entities/urn:li:dataset:(...)/aspects/editableSchemaMetadata

# Get table documentation
GET /entities/urn:li:dataset:(...)/aspects/editableDatasetProperties

# Get lineage
GET /relationships?urn=urn:li:dataset:(...)&direction=BOTH

# Get usage stats
GET /usageStats?resource=urn:li:dataset:(...)

Performance

  • Lookup Latency: ~50-200ms per table
  • Caching: 15-minute TTL reduces repeat lookups
  • Parallel Fetching: Multiple tables queried concurrently
  • Graceful Degradation: Falls back to schema-only if DataHub slow/down

Future Enhancements

On the roadmap:

  • Write-back: Update field descriptions from Qry based on usage patterns
  • ML Suggestions: AI-suggested metadata improvements
  • Lineage Visualization: Interactive data flow diagrams
  • Impact Analysis: "What breaks if I change this field?"
  • Quality Scoring: Automated metadata completeness reports

FAQ

Q: Do I need DataHub to use Qry? A: No, DataHub is optional but highly recommended for enterprise deployments.

Q: How does this differ from Qry's Domain Context feature? A: Domain Context uses uploaded PDFs for general knowledge. DataHub provides structured, field-level metadata tied directly to your database schema. Use both for maximum AI intelligence!

Q: Can Qry auto-populate DataHub from our existing data dictionary? A: Not directly, but DataHub supports bulk ingestion from CSV or via API. Ask your data team about migration tools.

Q: What if our DataHub instance has restricted network access? A: Ensure Qry's server can reach DataHub's API endpoint. Work with your network team to allow outbound connections.

Q: Does every field need documentation? A: No. Qry works fine with partial documentation. Start with high-value tables and expand over time. Even 20% documentation coverage provides significant benefits.

Next Steps

  • Learn about Domain Context for complementary business knowledge
  • Explore Data Profiling for statistical insights
  • Discover Python Execution for advanced data analysis and visualizations
  • Try Qry Nexus — your DataHub assets appear in the unified Discover search with breadcrumbs and click-to-navigate
  • Check the Admin Guide for detailed configuration

DataHub integration transforms Qry from a smart SQL generator into a business-aware analytics partner that truly understands your organization's data.