Published as a conference paper at ICLR 2026
DAP treats pretrained diffusion models as a source of diversity, generalization, and representativeness priors, enabling training-free guidance for higher-quality distilled datasets.
Dataset distillation aims to compress a large dataset into a much smaller synthetic one that still supports strong downstream training. Recent generative methods use diffusion models as powerful foundations, but mostly rely on their sampling quality alone. DAP asks a simple question: can diffusion models provide better priors for dataset distillation than we currently use?
Our answer is yes. DAP interprets pretrained diffusion models as carrying three useful priors: diversity, generalization, and an often overlooked representativeness prior. The key contribution is a training-free guidance term, defined through feature-space similarity with a Mercer kernel, that nudges reverse diffusion toward distilled samples that better match the original data distribution.
Distilled Dataset s.t. Diversity + Generalization + Representativeness
The paper argues that an ideal distilled dataset should simultaneously preserve coverage of the original data manifold, avoid overfitting to a single evaluation architecture, and retain the most critical information from the raw dataset. DAP is designed around this trifecta rather than optimizing sample realism alone.
Formally, DAP starts from the original diffusion score and injects representativeness as an additional conditional term:
∇x log p(x|R) = ∇x log p(x) + ∇x log p(R|x)
The first term contributes diversity and generalization, while the second term brings in representativeness guidance during sampling.
Pretrained diffusion models naturally cover multiple modes of the data distribution, helping distilled datasets avoid collapse.
Diffusion-based distillation is less tied to a single surrogate classifier, which improves transfer across architectures.
DAP formalizes representativeness in feature space and injects it as guidance during reverse diffusion, without retraining the generator.
In short, DAP turns diffusion models from generic generators into task-aware priors for dataset distillation. This leads to distilled samples that are not only diverse and realistic, but also more aligned with the original dataset and more robust across evaluation architectures.
49.1
Top-1 on ImageNet-1K at IPC10
62.7
Top-1 on ImageNet-1K at IPC50
68.1
Cross-architecture Top-1 on ResNet-101 at IPC50
On ImageNet-1K, DAP achieves state-of-the-art distilled data performance with 49.1% Top-1 accuracy at IPC10 and 62.7% at IPC50. These gains come without adding extra training to the diffusion model itself.
DAP also remains strong when the evaluation backbone changes. On ImageNet-1K cross-architecture transfer, it consistently outperforms prior methods on ResNet-101, MobileNet-V2, EfficientNet-B0, and Swin Transformer, supporting the claim that the distilled data is architecture-agnostic rather than overfit to a single classifier.
Table 3 from the paper. Results are evaluated with the hard-label protocol on ImageNette and ImageWoof.
We only keep visualizations that appear in the final paper. The figure below corresponds to Figure 5 in the paper and compares real versus synthetic feature distributions under IPC50.
The t-SNE plots show that DAP aligns synthetic samples with the training distribution while maintaining generalization to test data, supporting the paper's claims about diversity and representativeness.
ImageNette-Training
ImageNette-Test
ImageWoof-Training
ImageWoof-Test
Venue: ICLR 2026
Project page: https://suduo94.github.io/Diffusion-As-Priors
Benchmarks highlighted: ImageNet-1K, ImageNette, ImageWoof, ImageIDC
Backbones discussed: Stable Diffusion and DiT
@inproceedings{su2026dap,
title = {Diffusion Models as Dataset Distillation Priors},
author = {Su, Duo and Wu, Huyu and Chen, Huanran and Shi, Yiming and Wang, Yuzhu and Ye, Xi and Zhu, Jun},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}