SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions

NeurIPS 2023

KAIST

SyncDiffusion generates globally coherent panoramas by synchronizing multiple diffusions.

Abstract

The remarkable capabilities of pretrained image diffusion models have been utilized not only for generating fixed-size images but also for creating panoramas. However, naive stitching of multiple images often results in visible seams. Recent techniques have attempted to address this issue by performing joint diffusions in multiple windows and averaging latent features in overlapping regions. However, these approaches, which focus on seamless montage generation, often yield incoherent outputs by blending different scenes within a single image. To overcome this limitation, we propose SyncDiffusion, a plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss. Specifically, we compute the gradient of the perceptual loss using the predicted denoised images at each denoising step, providing meaningful guidance for achieving coherent montages. Our experimental results demonstrate that our method produces significantly more coherent outputs compared to previous methods (66.35% vs. 33.65% in our user study) while still maintaining fidelity (as assessed by GIQA) and compatibility with the input prompt (as measured by CLIP score).


Main Idea

Framework

LPIPS scores computed across the noisy images at the intermediate step (t = 45 out of 50) of the reverse process (left), the predicted denoised images at the same timestep t (middle), and the final generated images at timestep t = 0 (right). The indistinguishable noisy images yield similar LPIPS scores among them, whereas the predicted denoised images, which closely resemble the final outputs even at the beginning of the denoising process, exhibit LPIPS scores that align with those of the final generated images. This indicates that the predicted denoised images can provide meaningful guidance for producing coherent panoramas in the diffusion process.


HuggingFace Demo




Generated Images

SyncDiffusion can generate images of arbitrary resolutions by leveraging pre-trained Text-to-Image diffusion models (e.g. Stable Diffusion) without additional training.

New York (Ours)

"Skyline of New York City"

Concert (Ours)

"A photo of a rock concert"

Anime (Ours)

"Natural landscape in anime style illustration"

Grass (Ours)

"A photo of a grassland with animals"

Lalaland (Ours)

"Silhouette wallpaper of a dreamy scene with shooting stars"

Lake (Ours)

"A photo of a lake under the northern lights"

Lake (Ours)

"A photo of a mountain range at twilight"

Lake (Ours)

"A cinematic view of a castle in the sunset"

Waterfall (Ours)

"A waterfall"

Alley (Ours)

"A bird's eye view of an alley with shops"

Vines (Ours)

"A photo of vines on a brick wall"

Lake (Ours)

"A top view of a single railway"





Qualitative Comparisons


SyncDiffusion generates globally coherent panorama images compared to the state-of-the-art MultiDiffusion [Bar-Tal et al. 2023]. If you are on a mobile device, please rotate your device to a horizontal or landscape orientation for a better view.

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“A photo of a rock concert”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“Skyline of New York City”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“Natural landscape in anime style illustration”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“Silhouette wallpaper of a dreamy scene with shooting stars”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“A cinematic view of a castle in the sunset”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“Cartoon panorama of spring summer beautiful nature”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“A photo of a city skyline at night”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“A photo of a beautiful ocean with coral reef”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“A photo of a grassland with animals”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“A photo of a lake under the northern lights”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“An illustration of a beach in La La Land style”

Ours
MultiDiffusion [Bar-Tal et al. 2023]

“A photo of a mountain range at twilight”



Plug-and-Play Applications

applications

BibTeX

@inproceedings{
        lee2023syncdiffusion,
        title={SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions},
        author={Yuseung Lee and Kunho Kim and Hyunjin Kim and Minhyuk Sung},
        booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
        year={2023},
    }

Acknowledgements

We thank Juil Koo for valuable discussions on diffusion models and Eunji Hong for help in conducting user studies. This work was partially supported by the NRF grant (RS2023-00209723) and IITP grants (2019-0-00075, 2022-0-00594, RS-2023-00227592) funded by the Korean government (MSIT), the Technology Innovation Program (20016615) funded by the Korean government (MOTIE), grants from ETRI, KT, NCSOFT, and Samsung Electronics, and computing resource support from KISTI.