Dataset Inference for Data Provenance and Privacy Auditing in Tabular Foundation Models
Published in ICML 2026 Workshop on Foundation Models for Structured Data, 2026
Recommended citation: Dariush Wahdany, Jesse C. Cresswell, Naiqing Guan, Atiyeh Ashari Ghomi, Franzsika Boenisch, Adam Dziedzic. Dataset Inference for Data Provenance and Privacy Auditing in Tabular Foundation Models. ICML 2026 Workshop on Foundation Models for Structured Data
Tabular foundation models (TFMs) are increasingly deployed through black-box APIs and trained on real-world tabular datasets. While private, proprietary, or otherwise unauthorized datasets may be incorporated into pre-training corpora, there are currently no dedicated methods for determining whether a given tabular dataset was used to train a TFM. As a solution, we introduce the first dataset inference method for TFMs, aiming to infer whether a suspect dataset was part of a model’s pre-training data. We systematically analyze a broad collection of candidate signals that can be observed from a black-box TFM via input manipulation and find that we can reliably infer dataset membership for several state-of-the-art TFMs trained on real tabular data, achieving up to 0.997 ROCAUC. We then study factors that influence dataset identification, including pre-training data composition, model capacity, and the use of real vs. synthetic data. Our results show that our dataset inference method is a practical auditing tool for detecting privacy leakage and the use of proprietary datasets in TFMs.
[Paper] [PDF]