[논문리뷰] High-Resolution Image Synthesis with Latent Diffusion Models (LDM, Latent Diffusion)

High-Resolution Image Synthesis with Latent Diffusion Models

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism t

arxiv.org

내용은 별거 없다. input image의 전체 domain에서 수행하던 diffusion 연산을 latent space에서 한다. 끝!

Introduction

High-resolution image synthesis (논문에서는 generation task를 주로 의미)의 매우 큰 computational demand에서부터 출발한다. Autoregressive model이나 GANs와 같은 방법론들이 제시되어 왔지만 최근들어 가장 주목받는 것은 Diffusion models (DM)이다.

DM은 likelihood-based model로 ELBO를 통해 분포를 추정하기 때문에 implicit하게 분포를 추정하는 GANs에서 일어나는 mode collapse 같은 문제가 적고 Autoregressive models에 비해 적은 수의 파라미터로 동작이 가능하다. 데이터의 분포 전체를 모델링하기 때문에 (mode-covering) 높은 퀄리티의 이미지 생성이 가능하지만 동시에 이미지의 전체 domain에 있는 모든 detail을 고려하여 학습하기 때문에 연산량이 매우 커지는 문제가 있다.

따라서 입력 이미지에서 불필요한 detail은 버리고 잘 압축된 latent로 만들어서 Diffusion을 수행하면 연산량을 줄일 수 있다는 것이 본 논문의 핵심 아이디어이다.

Departure to Latent Space

논문에서는 이를 원본 이미지들의 space와 'perceptual하게 동등한' space를 찾고 여기에서 연산을 수행하는 것으로 표현한다.

위 그림은 이미 학습된 DM의 Rate-Distortion 그래프를 나타낸다. DM의 학습 단계는 두 단계로 나눌 수 있는데, 첫 번째는 perceptual compression stage로 이미지의 high-frequency detail을 제거하는 단계, 두 번째는 semantic compresion stage로 이미지의 전체적인 구성을 학습하는 단계이다. 그래프에서 볼 수 있듯 이미지의 대부분의 비트(픽셀)들은 imperceptible detail (전체적인 이미지 의미에는 크게 관련 없는)을 표현한다. DM의 경우 이런 불필요한 디테일까지 모두 학습하기 때문에 불필요한 연산량이 높아지는 것이다.

따라서 Autoencoder를 이용해서 perceptual하게 동등한 latent space를 찾고 (즉 원본 이미지와 의미론적으로 동일한... hidden representation만 담고 있는) 여기서 DM을 수행하면 효율적으로 Diffusion 연산을 수행할 수 있을 것이다. 이를 Latent Diffusion Models (LDMs)라고 명명한다.

Method

Perceptual Image Compression

이미지를 latent space로 보내줄 Autoencoder를 학습하는 과정이다. 논문에서는 VAE와 VQVAE 두가지를 사용했고, 어느 방법론을 사용하느냐에 따라 loss function에서 regularization term이 조금 달라진다. 학습 과정에서 adversarial training을 이용해 Autoencoder가 GAN의 generator로 기능하도록 한다. 학습 과정의 디테일은 저자들의 previous work인 Taming Tranformer를 보면 더 자세하다.

Taming Transformers for High-Resolution Image Synthesis

Abstract Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes th

compvis.github.io

Latent Diffusion Models

학습 objective는 DDPM에 기반한다.

Denoising Diffusion Probabilistic Models

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound

arxiv.org

위가 DDPM의 objective, 아래가 LDM의 objective이다. 분포가 x가 아닌 x를 인코딩한 Ɛ(x)를 따르고 x_t 대신 latent z_t를 쓴다는 점만 달라진다. ϵ_θ는 Denoising model이다.

Conditioning Mechanisms

다양한 조건부 모델링을 위해 condition 변수 y를 추가해서 objective를 다시 쓰면 아래와 같다.

여기서 τ는 condition 변수 y를 인코딩하기 위한 네트워크로 논문에서는 Transformer를 썼다. y를 z와 잘 mapping하기 위해 Denoising model (논문에서는 U-Net)의 각 layer의 중간 feature마다 query를 뽑고 condition 변수에서 key-value를 뽑아서 cross attention을 수행한다. (overview 피규어의 QKV QKV 괴랄한 부분)

'Deep Learning > Review' 카테고리의 다른 글

[논문리뷰] U-Net : Convolutional Networks for Biomedical Image Segmentation (0)	2022.09.15