TLDR: We bridge the gap between single-step training and multi-step inference in Masked Diffusion Models by introducing Co-GRPO, a framework that cooperatively optimizes both the generative model and the inference schedule using reinforcement learning, achieving superior visual quality and alignment without costly multi-step backpropagation.
Recently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective. This simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training.
In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks—ImageReward, HPS, GenEval, and DPG-Bench—demonstrate the effectiveness of our approach.
Co-GRPO reformulates the generation process as a joint Markov Decision Process where the agent acts on both visual tokens $V$ and inference schedule parameters $A$ (e.g., sampling temperature $\tau_s$, re-mask ratio $r$). To effectively explore this expanded space, we introduce a Factorized Policy that explicitly decouples the Model Policy $\pi_\theta$ from the Scheduling Policy $\pi_\phi$.
Based on this factorization, we propose an Alternating Co-Optimization strategy. Instead of unstable joint updates, we iteratively optimize $\theta$ and $\phi$ to maximize the Group Relative Advantage. This mechanism allows the framework to autonomously discover optimal decoding trajectories aligned with human preferences, effectively "self-driving" the inference process.
Co-GRPO consistently outperforms state-of-the-art baselines across all metrics, achieving superior performance in both reward alignment and zero-shot generalization tasks.
Representative high-quality images generated by Co-GRPO.
The BibTeX citation will be updated upon the paper's public release.
Please stay tuned!