A DBT-Based Column Lineage Tracking Approach for Regulatory and Audit Compliance
Main Article Content
Abstract
Regulatory programs are also becoming more and more intuitive about demonstrated column-level lineage of PII/PCI/PHI and financial qualities, which many organizations now have only table-based DAGs. The paper presents a compile-time solution based on the dbt-first proposal, currently processing compiled SQL and dbt artifacts to produce normalized column-to-column edges and storing them in an open governance store, which is mapped to control evidence. The strategy was tested with a production-similar project (≈250 models, 1,800 columns, ~9,000 edges) and Snowflake and BigQuery profiles in 1,200- 1,800-day schedules of tasks. The accuracy was measured on a stratified, dual-view ground truth (n=400): precision 0.971 (95% CI 0.952–0.987), recall 0.934 (0.905–0.960), F1 0.952. There was coverage of 96.7 overall (customer 97.8%, payments 96.4%, and health 95.1%) fields tagged with critical coverage. Operation impact was limited: compile/run deltas of between 7.6% and lineage metadata of less than 5 GB/month, and platform cost deltas of less than 250/month at scale stated. The compliance results were significantly improved: median DSAR turnaround decreased by 12.1 to 4.3 days (−64%); SOX evidence-pack assembly decreased to 18 minutes, and quarterly defect leakage remained 1.6%. Recall gains were seen in ablations with materialization of transient models (+3.2 pp), macro-depth limits and explicit select lists (+2.9 pp), and propagation of the best-effort UDF (+1.7 pp). Dynamic SQL, non-dbt pipelines are also limited; the key areas of future development are sub-minute streaming lineage, semantic equivalence +37 pp recall, and interoperability using standards.
Article Details

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.