SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions

Abstract

The remarkable capabilities of pretrained image diffusion models have been utilized not only for generating fixed-size images but also for creating panoramas. However, naive stitching of multiple images often results in visible seams. Recent techniques have attempted to address this issue by performing joint diffusions in multiple windows and averaging latent features in overlapping regions. However, these approaches, which focus on seamless montage generation, often yield incoherent outputs by blending different scenes within a single image. To overcome this limitation, we propose SyncDiffusion, a plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss. Specifically, we compute the gradient of the perceptual loss using the predicted denoised images at each denoising step, providing meaningful guidance for achieving coherent montages. Our experimental results demonstrate that our method produces significantly more coherent outputs compared to previous methods (66.35% vs. 33.65% in our user study) while still maintaining fidelity (as assessed by GIQA) and compatibility with the input prompt (as measured by CLIP score).

Main Idea

LPIPS scores computed across the noisy images at the intermediate step (t = 45 out of 50) of the reverse process (left), the predicted denoised images at the same timestep t (middle), and the final generated images at timestep t = 0 (right). The indistinguishable noisy images yield similar LPIPS scores among them, whereas the predicted denoised images, which closely resemble the final outputs even at the beginning of the denoising process, exhibit LPIPS scores that align with those of the final generated images. This indicates that the predicted denoised images can provide meaningful guidance for producing coherent panoramas in the diffusion process.

HuggingFace Demo

Qualitative Comparisons

SyncDiffusion generates globally coherent panorama images compared to the state-of-the-art MultiDiffusion [Bar-Tal et al. 2023]. If you are on a mobile device, please rotate your device to a horizontal or landscape orientation for a better view.

Ours