A DBT-Based Column Lineage Tracking Approach for Regulatory and Audit Compliance

Main Article Content

Shreekant Malviya, Abhishek Jaiswal, Vivek Koli

Abstract

Regulatory programs are also becoming more and more intuitive about demonstrated column-level lineage of PII/PCI/PHI and financial qualities, which many organizations now have only table-based DAGs. The paper presents a compile-time solution based on the dbt-first proposal, currently processing compiled SQL and dbt artifacts to produce normalized column-to-column edges and storing them in an open governance store, which is mapped to control evidence. The strategy was tested with a production-similar project (≈250 models, 1,800 columns, ~9,000 edges) and Snowflake and BigQuery profiles in 1,200- 1,800-day schedules of tasks. The accuracy was measured on a stratified, dual-view ground truth (n=400): precision 0.971 (95% CI 0.952–0.987), recall 0.934 (0.905–0.960), F1 0.952. There was coverage of 96.7 overall (customer 97.8%, payments 96.4%, and health 95.1%) fields tagged with critical coverage. Operation impact was limited: compile/run deltas of between 7.6% and lineage metadata of less than 5 GB/month, and platform cost deltas of less than 250/month at scale stated. The compliance results were significantly improved: median DSAR turnaround decreased by 12.1 to 4.3 days (−64%); SOX evidence-pack assembly decreased to 18 minutes, and quarterly defect leakage remained 1.6%. Recall gains were seen in ablations with materialization of transient models (+3.2 pp), macro-depth limits and explicit select lists (+2.9 pp), and propagation of the best-effort UDF (+1.7 pp). Dynamic SQL, non-dbt pipelines are also limited; the key areas of future development are sub-minute streaming lineage, semantic equivalence +37 pp recall, and interoperability using standards. 

Article Details

Section
Articles