Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference via Online Reward Adjustment

Abstract

Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.

Key contribution

Direct Align. We introduce a new sampling strategy for diffusion fine-tuning that can effectively restore highly noisy images, leading to an optimization process that is more stable and less computationally demanding, especially during the initial timesteps.

Faster Training. By rolling out only a single image and optimizing directly with analytical gradients—a key distinction from GRPO—our method achieves significant performance improvements for FLUX.1.dev in under 10 minutes of training. To further accelerate the process, our method supports replacing online rollouts entirely with a small dataset of real images; we find that fewer than 1500 images are sufficient to effectively train FLUX.1.dev.

Free of Reward Hacking. We have improved the training strategy for method that direct backpropagation on reward signal (such as ReFL and DRaFT). Moreover, we directly regularize the model using negative rewards, without the need for KL divergence or a separate reward system. In our experiments, this approach achieves comparable performance with multiple different rewards, improving the perceptual quality of FLUX.1.dev without suffering from reward hacking issues, such as overfitting to color or oversaturation preferences.

Potential for Controllable Fine-tuning. Our work is the first in online RL to incorporate dynamically controllable prompt augmentation directly into the reward model, enabling online adjustment of reward preference towards styles within the scope of the reward model.

Enhanced Realism (without AI look)

A close-up of a white flower with orange stamens on a branch.

A small blue-gray butterfly with black stripes rests on a white and yellow flower against a blurred green background.

A grey tabby cat with yellow eyes rests on a weathered wooden log under bright sunlight

An orange tabby cat with white markings sits on a rock, looking down

A person with dark skin and short black hair sits on a wooden chair, wearing a yellow floral dress against a grey wall with strong shadows.

A still frame from a black and white movie, featuring a man in classic attire, dramatic high contrast lighting, deep shadows, retro film grain, and a nostalgic cinematic mood.

A traditional Chinese building with red pillars and an ornate roof, with a pagoda visible in the background.

A view of countryside, natural lighting.

A bridge spans a wide river with a cityscape on the far bank, viewed from a grassy embankment.

Side of a street, where there is a fire hydrant and a mirror showing the street.

Two young ladies seated with several other people at a dinner table.

A still frame from a black and white movie, featuring a man in classic attire, dramatic high contrast lighting, deep shadows, retro film grain, and a nostalgic cinematic mood.

A single yellow flower with green stems stands out against a dark, blurred green background.

A close-up of a white flower with orange stamens on a branch.

Enhanced Aesthetic (without AI look)

The Death of Ophelia by John Everett Millais, Pre-Raphaelite painting, Ophelia floating in a river surrounded by flowers, detailed natural elements.

Starry Night.

Girl with a pearl earring.jpg

A digital pen lineart sketch of a Japanese schoolgirl.

A first-person screenshot from Half-Life.

An investigator fights a tentacled monster in a finely detailed horror film still.

Hooded figure standing over a ruined city with red haze and a grin.

Renaissance angel depicted in Gerhard Richter's oil painting.

Framework

To enable precise reward assignment during the early stages of the diffusion process, we reconstruct a clean image from an intermediate noisy image using a ground-truth noise prior and a single-step denoising operation. Specifically, we first generate a clean image and inject Gaussian noise, thereby establishing a closed-form expression for recovering images at any diffusion timestep. The Direct-Align pipeline consists of four key stages: (0) generating images for training; (1) injecting noise into the images; (2) performing a one-step denoising or inversion operation; and (3) recovering the images. The SRPO modifies the reward model by introducing two branches—penalty and reward—prior to scoring, which respectively evaluate the denoising and inversion processes.

Cross-Reward Performance (without Reward Hacking)

Lighting and Sytle Control

Ablations

BibTeX

@misc{shen2025directlyaligningdiffusiontrajectory,
      title={Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference}, 
      author={Xiangwei Shen and Zhimin Li and Zhantao Yang and Shiyi Zhang and Yingfang Zhang and Donghao Li and Chunyu Wang and Qinglin Lu and Yansong Tang},
      year={2025},
      eprint={2509.06942},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.06942}, 
}