Published as a conference paper at ICLR 2026

Diffusion Models as Dataset Distillation Priors

DAP treats pretrained diffusion models as a source of diversity, generalization, and representativeness priors, enabling training-free guidance for higher-quality distilled datasets.

Duo Su1, Huyu Wu2, Huanran Chen1, Yiming Shi3, Yuzhu Wang4, Xi Ye1, Jun Zhu1
1Tsinghua University 2Institute of Computing Technology, CAS 3University of Electronic Science and Technology of China 4South China University of Technology

Figure 1 from the paper. DAP highlights how diffusion priors jointly improve diversity, representativeness, and downstream performance for dataset distillation.

Overview

Dataset distillation aims to compress a large dataset into a much smaller synthetic one that still supports strong downstream training. Recent generative methods use diffusion models as powerful foundations, but mostly rely on their sampling quality alone. DAP asks a simple question: can diffusion models provide better priors for dataset distillation than we currently use?

Our answer is yes. DAP interprets pretrained diffusion models as carrying three useful priors: diversity, generalization, and an often overlooked representativeness prior. The key contribution is a training-free guidance term, defined through feature-space similarity with a Mercer kernel, that nudges reverse diffusion toward distilled samples that better match the original data distribution.

Key Idea

Motivation From Section 3.1

Distilled Dataset s.t. Diversity + Generalization + Representativeness

The paper argues that an ideal distilled dataset should simultaneously preserve coverage of the original data manifold, avoid overfitting to a single evaluation architecture, and retain the most critical information from the raw dataset. DAP is designed around this trifecta rather than optimizing sample realism alone.

Formally, DAP starts from the original diffusion score and injects representativeness as an additional conditional term:

x log p(x|R) = ∇x log p(x) + ∇x log p(R|x)

The first term contributes diversity and generalization, while the second term brings in representativeness guidance during sampling.

1. Diversity Prior

Pretrained diffusion models naturally cover multiple modes of the data distribution, helping distilled datasets avoid collapse.

2. Generalization Prior

Diffusion-based distillation is less tied to a single surrogate classifier, which improves transfer across architectures.

3. Representativeness Prior

DAP formalizes representativeness in feature space and injects it as guidance during reverse diffusion, without retraining the generator.

In short, DAP turns diffusion models from generic generators into task-aware priors for dataset distillation. This leads to distilled samples that are not only diverse and realistic, but also more aligned with the original dataset and more robust across evaluation architectures.

Main Results

49.1

Top-1 on ImageNet-1K at IPC10

62.7

Top-1 on ImageNet-1K at IPC50

68.1

Cross-architecture Top-1 on ResNet-101 at IPC50

On ImageNet-1K, DAP achieves state-of-the-art distilled data performance with 49.1% Top-1 accuracy at IPC10 and 62.7% at IPC50. These gains come without adding extra training to the diffusion model itself.

DAP also remains strong when the evaluation backbone changes. On ImageNet-1K cross-architecture transfer, it consistently outperforms prior methods on ResNet-101, MobileNet-V2, EfficientNet-B0, and Swin Transformer, supporting the claim that the distilled data is architecture-agnostic rather than overfit to a single classifier.

Table 3: Results on ImageNette and ImageWoof

Table 3 from the paper showing results on ImageNette and ImageWoof.

Table 3 from the paper. Results are evaluated with the hard-label protocol on ImageNette and ImageWoof.

Visual Evidence

We only keep visualizations that appear in the final paper. The figure below corresponds to Figure 5 in the paper and compares real versus synthetic feature distributions under IPC50.

The t-SNE plots show that DAP aligns synthetic samples with the training distribution while maintaining generalization to test data, supporting the paper's claims about diversity and representativeness.

Figure 5 ImageNette training t-SNE.

ImageNette-Training

Figure 5 ImageNette test t-SNE.

ImageNette-Test

Figure 5 ImageWoof training t-SNE.

ImageWoof-Training

Figure 5 ImageWoof test t-SNE.

ImageWoof-Test

Resources

Venue: ICLR 2026

Project page: https://suduo94.github.io/Diffusion-As-Priors

Benchmarks highlighted: ImageNet-1K, ImageNette, ImageWoof, ImageIDC

Backbones discussed: Stable Diffusion and DiT

BibTeX

@inproceedings{su2026dap,
  title     = {Diffusion Models as Dataset Distillation Priors},
  author    = {Su, Duo and Wu, Huyu and Chen, Huanran and Shi, Yiming and Wang, Yuzhu and Ye, Xi and Zhu, Jun},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}