Vector quantized image modeling with improved vqgan

Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. In "Vector-Quantized Image Modeling with Improved VQGAN", we propose a two-stage model that reconceives traditional image quantization techniques to yield improved performance on image generation and image understanding tasks.

论文标题：《Vector-Quantized Image Modeling with Improved VQGAN》—— ICLR 2022 作者信息：Jiahui Yu等 Google Research 这篇论文提出了VQGAN这样的模型不仅可以应用在图像生成中，其预训练模型还可以通过微调迁移到图像分类等任务中去。

We describe multiple improvements to the image quantizer and show that training a stronger image quantizer is a key component for improving both image generation and image understanding. We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning.

The Vector-Quantized (VQ) codebook is first introduced in VQVAE, which aims to learn discrete priors to encode images. The following work VQGAN proposes a perceptual codebook by further using perceptual loss and adversarial training objectives.

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks.

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis. Vector-Quantized (VQ-based) generative models usually consist of two basic components, i.e., VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects generation. Vision transformers (ViTs) have gained popularity recently. Even without customized image operators such as convolutions, ViTs can yield competitive performance when properly trained on massive data. However, the computational overhead of ViTs remains prohibitive, due to stacking multi-head self-attention modules and else. Compared to the vast literature and prevailing success in compressing CNNs, research on ViT compression remains limited. The first step is to encode an image into discrete latent codes of lesser dimensions using an image quantization model called VQGAN.

Image encoders compress an image into smaller dimensions, sometimes even quantized into a discrete space (such as the VQGAN from taming-transformers used in Craiyon). In this article, we try to reproduce the results from ViT-VQGAN ("Vector-quantized Image Modeling with Improved VQGAN") and experiment with further adaptations. A vector quantization library originally transcribed from Deepmind's tensorflow implementation, made conveniently into a package. It uses exponential moving averages to update the dictionary. VQ has been successfully used by Deepmind and OpenAI for high quality generation of images (VQ-VAE-2) and music (Jukebox).

此篇 ViT-VQGAN 為 VQ-GAN 的改良版本，沒看過的人可以看 The AI Epiphany 介紹的 VQ-GAN 和 VQ-VAE，這種類型的方法主要是要得到一個好的 quantizer，而 VQ-VAE 是透過 CNN-based 的 auto-encoder 把 latent space 變成類似像 dictionary 的 codebook (discrete…

We propose Vector-quantized Image Modeling (VIM), which pretrains a Transformer to predict image tokens autoregressively, where discrete image tokens are produced from improved ViT-VQGAN image quantizers. With our proposed improvements on image quantization, we demonstrate superior results on both image generation and understanding. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.

The Vector-Quantized (VQ) codebook is first introduced in VQVAE, which aims to learn discrete priors to encode images. The following work VQGAN proposes a perceptual codebook by further using perceptual loss and adversarial training objectives. We briefly describe the VQGAN model with its codebook in this section, and more details can be found in the original paper. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning.Vector-quantized Image Modeling with Improved VQGAN Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu ICLR 2022. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei arXiv 2022.VQ-Diffusion. Vector Quantized Diffusion (VQ-Diffusion) is a conditional latent diffusion model developed by the University of Science and Technology of China and Microsoft. 