Do Accelerated Diffusion Language Models Reason Faithfully?
Recent work has explored diffusion language models (DLMs) as an alternative to autoregressive (AR) generation for reasoning tasks, yet little is known about the faithfulness of their intermediate reasoning trajectories. This study introduces a preliminary framework for measuring Diffusion Chain-of-Thought (DoT) faithfulness and provides an initial empirical analysis using the LLaDA-8B model and its accelerated variant, dLLM-Cache.
Using trajectory-level linear probes on the GSM8K benchmark, we examine how answer-relevant information emerges and evolves across diffusion steps, and how caching affects this process. Results show that correctness information appears early in the diffusion trajectory, accumulates over time, and remains largely preserved under acceleration with only modest degradation.
While limited to a single acceleration method and probing-based evaluation, these findings provide early evidence that DLM reasoning dynamics can retain causal coherence under efficiency-oriented modifications. Future work will extend this framework with further diagnostics and acceleration methods to build a more complete understanding of faithfulness in diffusion-based reasoning.
We investigate the faithfulness of Diffusion Chain-of-Thought (DoT) and how train-free acceleration techniques affect this property. Unlike autoregressive models where reasoning unfolds token-by-token, DLMs denoise bidirectionally across a sequence, exposing a rich state trajectory over diffusion time that can be interpreted as "latent thoughts."
A DLM is faithful if its intermediate denoising states encode the causal information necessary for producing the final answer. A DoT is faithful if and only if its intermediate trajectory lies on the model's minimal causal path from input to prediction.
For each diffusion timestep $t$, we train lightweight linear probes to predict the model's final answer from the intermediate latent state $x_t$. The accuracy of these probes indicates how early answer-relevant information emerges in the trajectory.
We evaluate LLaDA-8B baseline alongside dLLM-Cache, which reuses intermediate representations across diffusion steps. We use the GSM8K math word problem dataset with $T = 64$ reverse diffusion steps.
Comparison of baseline LLaDA-8B against dLLM-Cache on GSM8K. Accuracy is measured via exact match on the final numerical answer, and latency is averaged over 500 samples.
| Model | Accuracy | Latency | Speedup |
|---|---|---|---|
| LLaDA-8B (Baseline) | 39.70% | 6.31s | 1.00× |
| dLLM-Cache | 36.80% | 3.54s | 1.78× |
If you find our work useful in your research, please consider citing:
@techreport{aggarwal2025fastandfaithful,
title={Fast \& Faithful: Diffusion Drift -- Do Accelerated Diffusion Language Models Reason Faithfully?},
author={Anirud Aggarwal and Omkar Pathak and Nayana Gadde},
year={2025},
institution={University of Maryland},
note={Course Project, CMSC 848R: Language Model Interpretability (Instructor: Sarah Wiegreffe)}
}