CDLM: Corrective Diffusion Language Models - Remask Trajectory Visualization

📋 Abstract

Diffusion language models are structurally well-suited for iterative error correction, as their non-causal denoising dynamics allow arbitrary positions in a sequence to be revised. However, standard masked diffusion language model (MDLM) training fails to induce such behavior, since incorrect but visible tokens receive no supervision, leaving token-level confidence uninformative about reliability. As a result, confidence-guided refinement is ineffective for targeted correction.

To address this mismatch, we study corrective behavior in diffusion language models and propose a correction-oriented post-training principle that explicitly supervises visible incorrect tokens, enabling MDLMs to acquire error-aware confidence and targeted refinement capabilities. To systematically evaluate corrective behavior, we introduce the Code Revision Benchmark (CRB), a controllable and executable benchmark designed to evaluate corrective behavior in diffusion language models.

🎯 Case Study: Remask Trajectory Visualization

🔍 What is this case study?
To provide an intuitive understanding of why standard MDLMs struggle with self-correction, we present a detailed case study using LLaDA-8B-Base on the HumanEval benchmark. We visualize the remasking trajectory during iterative refinement on corrupted code samples, showing how the model's confidence values evolve across denoising steps. The key finding is that standard MDLM training produces unreliable token-level confidence. The model fails to assign low confidence to error tokens, making confidence-guided remasking ineffective for targeted correction. As shown in the examples below, even when errors exist, the model's confidence distribution does not accurately reflect error locations, leading to failed or misdirected refinement attempts.

🔬 Why is Confidence Inaccurate in Standard MDLMs?

                    Key Insight: Since incorrect but visible tokens never receive training gradients, the model has no mechanism to learn that it should express uncertainty about potentially wrong tokens. The confidence values for visible tokens are essentially untrained and meaningless for error detection.
                

As illustrated in the figure, during MDLM training: the cross-marked boxes ([MASK]) represent masked tokens where the reconstruction loss is applied and gradients flow. The beige tokens are visible inputs that the model conditions on. Critically, the brown output positions (corresponding to unmasked/visible tokens) receive no supervision during training.

This training paradigm creates a fundamental mismatch: when we later attempt to use the model's confidence to identify errors in visible tokens, the confidence values are unreliable because the model was never trained to produce meaningful confidence estimates for non-masked positions.

Figure: MDLM training. Cross-marked boxes denote masked tokens, while beige tokens are visible inputs. Green outputs indicate masked positions where the reconstruction loss is applied, whereas brown outputs correspond to unmasked tokens that receive no supervision during training.

📊 Code Revision Benchmark (CRB)

Corrective behavior refers to the ability of a model to identify and fix errors in its own outputs through iterative refinement. In the context of diffusion language models, this means using token-level confidence to localize potentially incorrect tokens, remask them, and regenerate correct replacements through subsequent denoising steps.

To systematically evaluate corrective behavior in diffusion language models, we introduce the Code Revision Benchmark (CRB), a controllable and executable benchmark designed specifically for studying self-correction capabilities.

                    CRB Features:
                    Controlled token-level corruption (operator, identifier, literal substitutions)
Execution-based validation of correctness
Systematic categorization of error types
Support for HumanEval, HumanEval+, MBPP, and MBPP+

                

This controlled corruption approach allows us to precisely measure whether diffusion language models can: (1) detect which tokens are corrupted based on confidence, and (2) successfully correct the errors through iterative refinement.

Figure: CRB corruption pipeline. A canonical program is tokenized, corrupted via type-preserving token replacement, validated by execution, categorized, and generated as a benchmark instance.

🎯 Corrective Diffusion Language Models (CDLMs)

We define Corrective Diffusion Language Models (CDLMs) as diffusion language models that have been post-trained with an absorbing-uniform mixture objective. This post-training enables the model to develop error-aware confidence, allowing it to assign lower confidence to potentially incorrect tokens and thereby support effective confidence-guided correction.

To address the fundamental limitation of standard MDLM training, we propose a correction-oriented post-training principle based on an absorbing-uniform mixture objective. This approach explicitly supervises visible corrupted tokens alongside masked reconstruction, enabling the model to learn error-aware confidence.

                🔄 Two-Stage Mixture Corruption:
                Stage 1 (Absorbing-mask corruption): A fraction of tokens are replaced with [MASK], providing standard reconstruction supervision.
Stage 2 (Uniform replacement corruption): Among the remaining visible tokens, some are replaced with randomly sampled incorrect tokens. These corrupted-but-visible positions receive explicit supervision.

            

The key insight is that uniform replacement introduces explicit noise at visible positions, requiring the model to detect and correct corrupted tokens. The training objective becomes:

                ℒ = (1/|𝓜|) Σi∈𝓜 ℓi + λnoise · (1/|𝓝|) Σi∈𝓝 ℓi
            

Here 𝓜 denotes masked positions (standard reconstruction) and 𝓝 denotes positions corrupted via uniform replacement. The first term preserves denoising capability, while the second term trains the model to recognize and downweight incorrect visible content.

📈 Experimental Results: CDLM Outperforms Standard MDLMs

To validate the effectiveness of our error-aware post-training approach, we compare Corrective Diffusion Language Models (CDLMs) against standard MDLMs and base models on the Code Revision Benchmark (CRB).

Figure: Performance comparison on CRB. Left: Radar chart showing correction success rates across different corruption levels (n=1 to n=5). Right: Pass@1 improvement over refinement steps. CDLM consistently outperforms MDLM and Base, demonstrating effective iterative correction.

The results demonstrate that CDLMs substantially outperform standard MDLMs across all corruption levels and refinement steps:

Experiments show that CDLMs substantially outperform standard MDLMs in both error localization (larger confidence gap between clean and corrupted tokens) and iterative correction (higher Pass@1 on CRB).

These results validate our hypothesis: by explicitly supervising visible corrupted tokens during training, CDLMs develop meaningful token-level confidence that accurately reflects error locations, enabling effective confidence-guided refinement.

🚀 Beyond Correction: Improved Pure Generation

An important benefit of enhanced corrective behavior is that it also improves performance in pure code generation tasks. This happens because diffusion language models naturally make mistakes during the iterative denoising process, and a model with better error-awareness can more effectively correct its own errors during decoding.

Training + Decoding	HumanEval		HumanEval+		MBPP		MBPP+		Avg
Training + Decoding	Pass@1	Pass@10	Pass@1	Pass@10	Pass@1	Pass@10	Pass@1	Pass@10	Pass@1	Pass@10
MDLM (Vanilla)	0.190	0.384	0.170	0.341	0.106	0.312	0.177	0.449	0.161	0.372
CDLM (Vanilla)	0.216	0.415	0.205	0.390	0.131	0.352	0.207	0.472	0.190	0.407
Base (Vanilla)	0.188	0.366	0.167	0.317	0.157	0.350	0.231	0.502	0.186	0.384
MDLM (ReMDM)	0.198	0.384	0.172	0.341	0.108	0.328	0.170	0.444	0.162	0.374
CDLM (ReMDM)	0.212	0.427	0.200	0.378	0.141	0.350	0.211	0.476	0.191	0.408
Base (ReMDM)	0.182	0.329	0.164	0.293	0.160	0.366	0.232	0.485	0.185	0.368

Table: Pass@1 and Pass@10 on pure completion tasks across coding benchmarks. All evaluations use absorbing-mask reconstruction with a shared remask-based decoding framework. Vanilla decoding updates confidence at every refinement step, while ReMDM fixes each token's confidence to the value it receives when first unmasked. CDLM consistently outperforms MDLM under both decoding strategies.

This finding highlights that corrective post-training benefits go beyond targeted error correction. By learning to recognize and downweight potentially incorrect tokens, CDLMs become fundamentally better at the iterative refinement process that underlies diffusion-based generation.

📝 Citation

If you find this work useful, please cite:

@misc{zhang2025correctivediffusionlanguagemodels,
      title={Corrective Diffusion Language Models}, 
      author={Shuibai Zhang and Fred Zhangzhi Peng and Yiheng Zhang and Jin Pan and Grigorios G. Chrysos},
      year={2025},
      eprint={2512.15596},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.15596}, 
}

🔄 CDLM: Corrective Diffusion Language Models