Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model
University of Maryland, College Park
Abstract
Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures.
We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models.
Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-α, PixArt-Σ, and FLUX-1.dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-α, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x.
Interactive Pareto Frontier
Explore the discovered Pareto frontiers interactively. The visualization shows PixArt-α at 256×256 resolution on PartiPrompts (unseen during optimization). These images give some visual insight into the quality of the images generated by each schedule, while the quality metrics in the tables below provide a more quantitative comparison.
Best viewed on desktop for full interactivity.
Method Overview
ECAD formulates diffusion caching as a multi-objective optimization problem, discovering Pareto-optimal trade-offs between computational efficiency and generation quality. Our approach uses genetic algorithms to evolve caching schedules represented as binary tensors $S \in \{0,1\}^{N \times B \times C}$, where $N$ is the number of diffusion steps, $B$ is the number of transformer blocks, and $C$ is the number of cacheable components per block.
Key Components
- Component-level Caching: We cache individual transformer components (self-attention, cross-attention, feedforward) rather than entire blocks.
- Genetic Algorithm: NSGA-II evolves a population of caching schedules using selection, crossover, and mutation operations.
- Multi-objective Optimization: Simultaneously optimize for low computational cost (TMACs) and high generation quality (Image Reward).
- Calibration-based: Uses only 100 text prompts for optimization, no image data required.
ECAD Algorithm (click to expand)
Input: Diffusion model $\mathcal{M}$, calibration prompts $\mathcal{P} = \{p_1, ..., p_m\}$, population size $n$, generations $G$, crossover probability $p_c$, mutation probability $p_m$
Output: Pareto frontier $\mathcal{F}$ of caching schedules
- $\mathcal{P}_0 \leftarrow \text{InitializePopulation}(n)$ // Random and heuristic schedules
- for $g = 1$ to $G$ do
- for each schedule $S \in \mathcal{P}_{g-1}$ do
- $\mathcal{I} \leftarrow \mathcal{M}_S(\mathcal{P})$ // Generate images with schedule $S$
- $S_{q} \leftarrow \mathcal{Q}(\mathcal{P}, \mathcal{I})$ // Compute Image Reward score
- $S_{c} \leftarrow \mathcal{C}(S)$ // Compute TMACs
- end for
- $\mathcal{P}_g \leftarrow \text{NSGA-II-Selection}(\mathcal{P}_{g-1})$ // Tournament selection
- $\mathcal{P}_g \leftarrow \text{Crossover}(\mathcal{P}_g, p_c)$ // 4-point crossover
- $\mathcal{P}_g \leftarrow \text{Mutation}(\mathcal{P}_g, p_m)$ // Bit-flip mutation
- end for
- $\mathcal{F} \leftarrow \text{ComputeParetoFrontier}(\bigcup_{i=1}^{G} \mathcal{P}_i)$
- return $\mathcal{F}$
Note: Each schedule $S \in \{0,1\}^{N \times B \times C}$ is a binary tensor where $N$ = diffusion steps, $B$ = transformer blocks, $C$ = cacheable components per block.
Input: Diffusion model $\mathcal{M}$, calibration prompts $\mathcal{P} = \{p_1, ..., p_m\}$, population size $n$, generations $G$...
Output: Pareto frontier $\mathcal{F}$ of caching schedules
- $\mathcal{P}_0 \leftarrow \text{InitializePopulation}(n)$
- for $g = 1$ to $G$ do
- for each schedule $S \in \mathcal{P}_{g-1}$ do
Results
Quantitative Results at 256×256 Resolution
We evaluated ECAD on three popular diffusion models with 20-step generation. Our method consistently outperforms prior approaches across multiple metrics while providing flexible speed-quality trade-offs. Despite being optimized only on Image Reward using 100 calibration prompts, ECAD achieves superior results on unseen benchmarks.
PixArt-α Results (click to expand)
| Method | Setting | TMACs↓ | Speedup↑ | Image Reward↑ | MJHQ FID↓ | MJHQ CLIP↑ |
|---|---|---|---|---|---|---|
| None | - | 5.71 | 1.00× | 0.97 | 9.75 | 32.77 |
| TGATE | m=15, k=1 | 4.86 | 1.14× | 0.87 | 10.38 | 32.33 |
| FORA | N=2 | 2.87 | 1.65× | 0.91 | 10.33 | 32.74 |
| ToCa | N=3, R=90% | 2.13 | 2.35× | 0.68 | 11.80 | 32.35 |
| DuCa | N=3, R=60% | 3.20 | 2.29× | 0.79 | 11.69 | 32.48 |
| DuCa | N=3, R=90% | 2.30 | 2.59× | 0.74 | 12.53 | 32.39 |
| ECAD | fast | 2.13 | 1.97× | 0.99 | 8.02 | 32.78 |
| ECAD | faster | 1.46 | 2.40× | 0.88 | 9.92 | 32.34 |
| ECAD | fastest | 1.18 | 2.58× | 0.77 | 8.67 | 32.24 |
| Method | Setting | TMACs↓ | Speedup↑ | Image Reward↑ | MJHQ FID↓ | MJHQ CLIP↑ |
|---|---|---|---|---|---|---|
| None | - | 5.71 | 1.00× | 0.97 | 9.75 | 32.77 |
| TGATE | m=15, k=1 | 4.86 | 1.14× | 0.87 | 10.38 | 32.33 |
| FORA | N=2 | 2.87 | 1.65× | 0.91 | 10.33 | 32.74 |
FLUX-1.dev Results (click to expand)
| Method | Setting | TMACs↓ | Speedup↑ | Image Reward↑ | MJHQ FID↓ | MJHQ CLIP↑ |
|---|---|---|---|---|---|---|
| None | - | 198.69 | 1.00× | 1.04 | 17.77 | 31.06 |
| FORA | N=3 | 69.80 | 2.44× | 0.93 | 19.38 | 31.10 |
| ToCa | N=4, R=90% | 42.96* | 1.66×* | 0.93 | 21.59 | 30.88 |
| DiCache | - | 62.23 | 2.26× | 0.97 | 20.70 | 31.18 |
| TaylorSeer | N=5, O=2 | 59.88* | 2.55×* | 0.54 | 24.36 | 30.64 |
| TaylorSeer | N=6, O=1 | 49.97* | 3.03×* | 0.02 | 37.98 | 29.38 |
| ECAD | fast | 63.02 | 2.58× | 1.04 | 16.14 | 31.69 |
| ECAD | fastest | 43.60 | 3.37× | 0.89 | 21.43 | 31.67 |
| Method | Setting | TMACs↓ | Speedup↑ | Image Reward↑ | MJHQ FID↓ | MJHQ CLIP↑ |
|---|---|---|---|---|---|---|
| None | - | 198.69 | 1.00× | 1.04 | 17.77 | 31.06 |
| FORA | N=3 | 69.80 | 2.44× | 0.93 | 19.38 | 31.10 |
| ToCa | N=4, R=90% | 42.96* | 1.66×* | 0.93 | 21.59 | 30.88 |
Resolution Transfer Results (FLUX-1.dev 1024×1024) (click to expand)
One of ECAD's key strengths is its ability to generalize across resolutions. We demonstrate this by applying schedules optimized at 256×256 resolution directly to 1024×1024 image generation without any additional optimization, for FLUX-1.dev. Despite the 16× increase in pixel count, our schedules maintain competitive performance compared to methods specifically optimized for high resolution.
| Method | Setting | TMACs↓ | Speedup↑ | Image Reward↑ | COCO FID↓ | COCO CLIP↑ |
|---|---|---|---|---|---|---|
| None | - | 1190.25 | 1.00× | 1.14 | 25.45 | 31.08 |
| None | 40% steps | 476.10 | 2.41× | 0.83 | 25.20 | 30.73 |
| FORA | N=3 | 416.88 | 2.40× | 0.69 | 29.45 | 30.52 |
| ToCa | N=4, R=90% | 300.41* | 2.47×* | 1.09 | 26.88 | 31.32 |
| TaylorSeer | N=5, O=2 | 357.39* | 2.54×* | 0.94 | 42.81 | 31.74 |
| ECAD | slow256→1024 | 644.05 | 1.73× | 1.05 | 22.15 | 31.00 |
| ECAD | fast256→1024 | 376.62 | 2.63× | 1.05 | 26.69 | 30.91 |
This demonstrates that ECAD's discovered caching patterns capture fundamental properties of the diffusion process that remain effective across different resolutions, making it practical for deployment in varied settings without requiring resolution-specific optimization.
| Method | Setting | TMACs↓ | Speedup↑ | Image Reward↑ | COCO FID↓ | COCO CLIP↑ |
|---|---|---|---|---|---|---|
| None | - | 1190.25 | 1.00× | 1.14 | 25.45 | 31.08 |
| None | 40% steps | 476.10 | 2.41× | 0.83 | 25.20 | 30.73 |
| FORA | N=3 | 416.88 | 2.40× | 0.69 | 29.45 | 30.52 |
Qualitative Results
Learned Caching Schedules
ECAD discovers diverse caching patterns that vary across timesteps, blocks, and components. Red indicates cached components, gray indicates recomputed.
Citation
If you find our work useful, please consider citing:
@misc{aggarwal2025evolutionarycachingaccelerateofftheshelf,
title={Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model},
author={Anirud Aggarwal and Abhinav Shrivastava and Matthew Gwilliam},
year={2025},
eprint={2506.15682},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.15682},
}