DecoderTrainer sample method uses the exponentially moving averaged

allow for division of loss prior to scaling, for gradient accumulation purposes
take care of mixed precision, and make gradient accumulation do-able externally
2026-02-12 19:44:26 +01:00 · 2022-04-30 14:55:34 -07:00 · 2022-04-30 12:56:47 -07:00 · 2022-04-30 12:27:24 -07:00 · 2022-04-30 11:56:05 -07:00 · 2022-04-30 11:45:18 -07:00
8 changed files with 2258 additions and 369 deletions
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 <img src="./dalle2.png" width="450px"></img>

-## DALL-E 2 - Pytorch (wip)
+## DALL-E 2 - Pytorch

 Implementation of <a href="https://openai.com/dall-e-2/">DALL-E 2</a>, OpenAI's updated text-to-image synthesis neural network, in Pytorch.

@@ -10,11 +10,9 @@ The main novelty seems to be an extra layer of indirection with the prior networ

 This model is SOTA for text-to-image for now.

-It may also explore an extension of using <a href="https://huggingface.co/spaces/multimodalart/latentdiffusion">latent diffusion</a> in the decoder from Rombach et al.
-
 Please join <a href="https://discord.gg/xBPBXfcFHd"><img alt="Join us on Discord" src="https://img.shields.io/discord/823813159592001537?color=5865F2&logo=discord&logoColor=white"></a> if you are interested in helping out with the replication

-There was enough interest for a Jax version. It will be completed after the Pytorch version shows signs of life on my toy tasks. <a href="https://github.com/lucidrains/dalle2-jax">Placeholder repository</a>
+There was enough interest for a <a href="https://github.com/lucidrains/dalle2-jax">Jax version</a>. I will also eventually extend this to <a href="https://github.com/lucidrains/dalle2-video">text to video</a>, once the repository is in a good place.

 ## Install

@@ -49,7 +47,7 @@ clip = CLIP(
    use_all_token_embeds = True,            # whether to use fine-grained contrastive learning (FILIP)
    decoupled_contrastive_learning = True,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
    extra_latent_projection = True,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
-    use_visual_ssl = True,                  # whether to do self supervised learning on iages
+    use_visual_ssl = True,                  # whether to do self supervised learning on images
    visual_ssl_type = 'simclr',             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
    use_mlm = False,                        # use masked language learning (MLM) on text (DeCLIP)
    text_ssl_loss_weight = 0.05,            # weight for text MLM loss
@@ -112,7 +110,8 @@ decoder = Decoder(
    unet = unet,
    clip = clip,
    timesteps = 100,
-    cond_drop_prob = 0.2
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5
 ).cuda()

 # mock images (get a lot of this)
@@ -197,10 +196,10 @@ clip = CLIP(
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 49408,
-    text_enc_depth = 1,
+    text_enc_depth = 6,
    text_seq_len = 256,
    text_heads = 8,
-    visual_enc_depth = 1,
+    visual_enc_depth = 6,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8
@@ -209,29 +208,30 @@ clip = CLIP(
 # 2 unets for the decoder (a la cascading DDPM)

 unet1 = Unet(
-    dim = 16,
+    dim = 32,
    image_embed_dim = 512,
+    cond_dim = 128,
    channels = 3,
    dim_mults = (1, 2, 4, 8)
 ).cuda()

 unet2 = Unet(
-    dim = 16,
+    dim = 32,
    image_embed_dim = 512,
-    lowres_cond = True,         # subsequent unets must have this turned on (and first unet must have this turned off)
    cond_dim = 128,
    channels = 3,
    dim_mults = (1, 2, 4, 8, 16)
 ).cuda()

-# decoder, which contains the unet and clip
+# decoder, which contains the unet(s) and clip

 decoder = Decoder(
    clip = clip,
    unet = (unet1, unet2),            # insert both unets in order of low resolution to highest resolution (you can have as many stages as you want here)
-    image_sizes = (256, 512),         # resolutions, 256 for first unet, 512 for second
-    timesteps = 100,
-    cond_drop_prob = 0.2
+    image_sizes = (256, 512),         # resolutions, 256 for first unet, 512 for second. these must be unique and in ascending order (matches with the unets passed in)
+    timesteps = 1000,
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5
 ).cuda()

 # mock images (get a lot of this)
@@ -248,16 +248,9 @@ loss = decoder(images, unet_number = 2)
 loss.backward()

 # do the above for many steps for both unets
-
-# then it will learn to generate images based on the CLIP image embeddings
-
-# chaining the unets from lowest resolution to highest resolution (thus cascading)
-
-mock_image_embed = torch.randn(1, 512).cuda()
-images = decoder.sample(mock_image_embed) # (1, 3, 512, 512)
 ```

-Finally, to generate the DALL-E2 images from text. Insert the trained `DiffusionPrior` as well as the `Decoder` (which both contains `CLIP`, a unet, and a causal transformer)
+Finally, to generate the DALL-E2 images from text. Insert the trained `DiffusionPrior` as well as the `Decoder` (which wraps `CLIP`, the causal transformer, and unet(s))

 ```python
 from dalle2_pytorch import DALLE2
@@ -349,8 +342,7 @@ unet2 = Unet(
    image_embed_dim = 512,
    cond_dim = 128,
    channels = 3,
-    dim_mults = (1, 2, 4, 8, 16),
-    lowres_cond = True
+    dim_mults = (1, 2, 4, 8, 16)
 ).cuda()

 decoder = Decoder(
@@ -358,7 +350,9 @@ decoder = Decoder(
    image_sizes = (128, 256),
    clip = clip,
    timesteps = 100,
-    cond_drop_prob = 0.2
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5,
+    condition_on_text_encodings = False  # set this to True if you wish to condition on text during training and sampling
 ).cuda()

 for unet_number in (1, 2):
@@ -386,7 +380,413 @@ You can also train the decoder on images of greater than the size (say 512x512)

 For the layperson, no worries, training will all be automated into a CLI tool, at least for small scale training.

-## CLI Usage (work in progress)
+## Training on Preprocessed CLIP Embeddings
+
+It is likely, when scaling up, that you would first preprocess your images and text into corresponding embeddings before training the prior network. You can do so easily by simply passing in `image_embed`, `text_embed`, and optionally `text_encodings` and `text_mask`
+
+Working example below
+
+```python
+import torch
+from dalle2_pytorch import DiffusionPriorNetwork, DiffusionPrior, CLIP
+
+# get trained CLIP from step one
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 6,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 6,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8,
+).cuda()
+
+# setup prior network, which contains an autoregressive transformer
+
+prior_network = DiffusionPriorNetwork(
+    dim = 512,
+    depth = 6,
+    dim_head = 64,
+    heads = 8
+).cuda()
+
+# diffusion prior network, which contains the CLIP and network (with transformer) above
+
+diffusion_prior = DiffusionPrior(
+    net = prior_network,
+    clip = clip,
+    timesteps = 100,
+    cond_drop_prob = 0.2,
+    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
+).cuda()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# precompute the text and image embeddings
+# here using the diffusion prior class, but could be done with CLIP alone
+
+clip_image_embeds = diffusion_prior.clip.embed_image(images).image_embed
+clip_text_embeds = diffusion_prior.clip.embed_text(text).text_embed
+
+# feed text and images into diffusion prior network
+
+loss = diffusion_prior(
+    text_embed = clip_text_embeds,
+    image_embed = clip_image_embeds
+)
+
+loss.backward()
+
+# do the above for many many many steps
+# now the diffusion prior can generate image embeddings from the text embeddings
+```
+
+You can also completely go `CLIP`-less, in which case you will need to pass in the `image_embed_dim` into the `DiffusionPrior` on initialization
+
+```python
+import torch
+from dalle2_pytorch import DiffusionPriorNetwork, DiffusionPrior
+
+# setup prior network, which contains an autoregressive transformer
+
+prior_network = DiffusionPriorNetwork(
+    dim = 512,
+    depth = 6,
+    dim_head = 64,
+    heads = 8
+).cuda()
+
+# diffusion prior network, which contains the CLIP and network (with transformer) above
+
+diffusion_prior = DiffusionPrior(
+    net = prior_network,
+    image_embed_dim = 512,               # this needs to be set
+    timesteps = 100,
+    cond_drop_prob = 0.2,
+    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
+).cuda()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# precompute the text and image embeddings
+# here using the diffusion prior class, but could be done with CLIP alone
+
+clip_image_embeds = torch.randn(4, 512).cuda()
+clip_text_embeds = torch.randn(4, 512).cuda()
+
+# feed text and images into diffusion prior network
+
+loss = diffusion_prior(
+    text_embed = clip_text_embeds,
+    image_embed = clip_image_embeds
+)
+
+loss.backward()
+
+# do the above for many many many steps
+# now the diffusion prior can generate image embeddings from the text embeddings
+```
+
+## OpenAI CLIP
+
+Although there is the possibility they are using an unreleased, more powerful CLIP, you can use one of the released ones, if you do not wish to train your own CLIP from scratch. This will also allow the community to more quickly validate the conclusions of the paper.
+
+To use a pretrained OpenAI CLIP, simply import `OpenAIClipAdapter` and pass it into the `DiffusionPrior` or `Decoder` like so
+
+```python
+import torch
+from dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder, OpenAIClipAdapter
+
+# openai pretrained clip - defaults to ViT/B-32
+
+clip = OpenAIClipAdapter()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# prior networks (with transformer)
+
+prior_network = DiffusionPriorNetwork(
+    dim = 512,
+    depth = 6,
+    dim_head = 64,
+    heads = 8
+).cuda()
+
+diffusion_prior = DiffusionPrior(
+    net = prior_network,
+    clip = clip,
+    timesteps = 100,
+    cond_drop_prob = 0.2
+).cuda()
+
+loss = diffusion_prior(text, images)
+loss.backward()
+
+# do above for many steps ...
+
+# decoder (with unet)
+
+unet1 = Unet(
+    dim = 128,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults=(1, 2, 4, 8)
+).cuda()
+
+unet2 = Unet(
+    dim = 16,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16)
+).cuda()
+
+decoder = Decoder(
+    unet = (unet1, unet2),
+    image_sizes = (128, 256),
+    clip = clip,
+    timesteps = 100,
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5,
+    condition_on_text_encodings = False  # set this to True if you wish to condition on text during training and sampling
+).cuda()
+
+for unet_number in (1, 2):
+    loss = decoder(images, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
+    loss.backward()
+
+# do above for many steps
+
+dalle2 = DALLE2(
+    prior = diffusion_prior,
+    decoder = decoder
+)
+
+images = dalle2(
+    ['a butterfly trying to escape a tornado'],
+    cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition)
+)
+
+# save your image (in this example, of size 256x256)
+```
+
+Now you'll just have to worry about training the Prior and the Decoder!
+
+## Experimental
+
+### DALL-E2 with Latent Diffusion
+
+This repository decides to take the next step and offer DALL-E v2 combined with <a href="https://huggingface.co/spaces/multimodalart/latentdiffusion">latent diffusion</a>, from Rombach et al.
+
+You can use it as follows. Latent diffusion can be limited to just the first U-Net in the cascade, or to any number you wish.
+
+The repository also comes equipped with all the necessary settings to recreate `ViT-VQGan` from the <a href="https://arxiv.org/abs/2110.04627">Improved VQGans</a> paper. Furthermore, the <a href="https://github.com/lucidrains/vector-quantize-pytorch">vector quantization</a> library also comes equipped to do <a href="https://arxiv.org/abs/2203.01941">residual or multi-headed quantization</a>, which I believe will give an even further boost in performance to the autoencoder.
+
+```python
+import torch
+from dalle2_pytorch import Unet, Decoder, CLIP, VQGanVAE
+
+# trained clip from step 1
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 1,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 1,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8
+)
+
+# 3 unets for the decoder (a la cascading DDPM)
+
+# first two unets are doing latent diffusion
+# vqgan-vae must be trained beforehand
+
+vae1 = VQGanVAE(
+    dim = 32,
+    image_size = 256,
+    layers = 3,
+    layer_mults = (1, 2, 4)
+)
+
+vae2 = VQGanVAE(
+    dim = 32,
+    image_size = 512,
+    layers = 3,
+    layer_mults = (1, 2, 4)
+)
+
+unet1 = Unet(
+    dim = 32,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    sparse_attn = True,
+    sparse_attn_window = 2,
+    dim_mults = (1, 2, 4, 8)
+)
+
+unet2 = Unet(
+    dim = 32,
+    image_embed_dim = 512,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16),
+    cond_on_image_embeds = True,
+    cond_on_text_encodings = False
+)
+
+unet3 = Unet(
+    dim = 32,
+    image_embed_dim = 512,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16),
+    cond_on_image_embeds = True,
+    cond_on_text_encodings = False,
+    attend_at_middle = False
+)
+
+# decoder, which contains the unet(s) and clip
+
+decoder = Decoder(
+    clip = clip,
+    vae = (vae1, vae2),                # latent diffusion for unet1 (vae1) and unet2 (vae2), but not for the last unet3
+    unet = (unet1, unet2, unet3),      # insert unets in order of low resolution to highest resolution (you can have as many stages as you want here)
+    image_sizes = (256, 512, 1024),    # resolutions, 256 for first unet, 512 for second, 1024 for third
+    timesteps = 100,
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5
+).cuda()
+
+# mock images (get a lot of this)
+
+images = torch.randn(1, 3, 1024, 1024).cuda()
+
+# feed images into decoder, specifying which unet you want to train
+# each unet can be trained separately, which is one of the benefits of the cascading DDPM scheme
+
+with decoder.one_unet_in_gpu(1):
+    loss = decoder(images, unet_number = 1)
+    loss.backward()
+
+with decoder.one_unet_in_gpu(2):
+    loss = decoder(images, unet_number = 2)
+    loss.backward()
+
+with decoder.one_unet_in_gpu(3):
+    loss = decoder(images, unet_number = 3)
+    loss.backward()
+
+# do the above for many steps for both unets
+
+# then it will learn to generate images based on the CLIP image embeddings
+
+# chaining the unets from lowest resolution to highest resolution (thus cascading)
+
+mock_image_embed = torch.randn(1, 512).cuda()
+images = decoder.sample(mock_image_embed) # (1, 3, 1024, 1024)
+```
+
+## Training wrapper (wip)
+
+### Decoder Training
+
+Training the `Decoder` may be confusing, as one needs to keep track of an optimizer for each of the `Unet`(s) separately. Each `Unet` will also need its own corresponding exponential moving average. The `DecoderTrainer` hopes to make this simple, as shown below
+
+```python
+import torch
+from dalle2_pytorch import DALLE2, Unet, Decoder, CLIP, DecoderTrainer
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 6,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 6,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8
+).cuda()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# decoder (with unet)
+
+unet1 = Unet(
+    dim = 128,
+    image_embed_dim = 512,
+    text_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults=(1, 2, 4, 8)
+).cuda()
+
+unet2 = Unet(
+    dim = 16,
+    image_embed_dim = 512,
+    text_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16),
+    cond_on_text_encodings = True
+).cuda()
+
+decoder = Decoder(
+    unet = (unet1, unet2),
+    image_sizes = (128, 256),
+    clip = clip,
+    timesteps = 1000,
+    condition_on_text_encodings = True
+).cuda()
+
+decoder_trainer = DecoderTrainer(
+    decoder,
+    lr = 3e-4,
+    wd = 1e-2,
+    ema_beta = 0.99,
+    ema_update_after_step = 1000,
+    ema_update_every = 10,
+)
+
+for unet_number in (1, 2):
+    loss = decoder_trainer(images, text = text, unet_number = unet_number)  # use the decoder_trainer forward
+    loss.backward()
+
+    decoder_trainer.update(unet_number) # update the specific unet as well as its exponential moving average
+
+# after much training
+# you can sample from the exponentially moving averaged unets as so
+
+mock_image_embed = torch.randn(4, 512).cuda()
+images = decoder.sample(mock_image_embed, text = text) # (4, 3, 256, 256)
+```
+
+## CLI (wip)

 ```bash
 $ dream 'sharing a sunset at the summit of mount everest with my dog'
@@ -394,9 +794,7 @@ $ dream 'sharing a sunset at the summit of mount everest with my dog'

 Once built, images will be saved to the same directory the command is invoked

-## Training wrapper (wip)
-
-Offer training wrappers
+<a href="https://github.com/lucidrains/big-sleep">template</a>

 ## Training CLI (wip)

@@ -410,14 +808,24 @@ Offer training wrappers
 - [x] augment unet so that it can also be conditioned on text encodings (although in paper they hinted this didn't make much a difference)
 - [x] figure out all the current bag of tricks needed to make DDPMs great (starting with the blur trick mentioned in paper)
 - [x] build the cascading ddpm by having Decoder class manage multiple unets at different resolutions
- [ ] use an image resolution cutoff and do cross attention conditioning only if resources allow, and MLP + sum conditioning on rest
- [ ] make unet more configurable
- [ ] figure out some factory methods to make cascading unet instantiations less error-prone
- [ ] offload unets not being trained on to CPU for memory efficiency (for training each resolution unets separately)
+- [x] add efficient attention in unet
+- [x] be able to finely customize what to condition on (text, image embed) for specific unet in the cascade (super resolution ddpms near the end may not need too much conditioning)
+- [x] offload unets not being trained on to CPU for memory efficiency (for training each resolution unets separately)
+- [x] build out latent diffusion architecture, with the vq-reg variant (vqgan-vae), make it completely optional and compatible with cascading ddpms
+- [x] for decoder, allow ability to customize objective (predict epsilon vs x0), in case latent diffusion does better with prediction of x0
+- [x] use attention-based upsampling https://arxiv.org/abs/2112.11435
+- [x] use inheritance just this once for sharing logic between decoder and prior network ddpms
+- [x] bring in vit-vqgan https://arxiv.org/abs/2110.04627 for the latent diffusion
+- [x] abstract interface for CLIP adapter class, so other CLIPs can be brought in
+- [x] take care of mixed precision as well as gradient accumulation within decoder trainer
+- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet
+- [ ] copy the cascading ddpm code to a separate repo (perhaps https://github.com/lucidrains/denoising-diffusion-pytorch) as the main contribution of dalle2 really is just the prior network
+- [ ] transcribe code to Jax, which lowers the activation energy for distributed training, given access to TPUs
+- [ ] just take care of the training for the decoder in a wrapper class, as each unet in the cascade will need its own optimizer
 - [ ] train on a toy task, offer in colab
- [ ] add attention to unet - apply some personal tricks with efficient attention - use the sparse attention mechanism from https://github.com/lucidrains/vit-pytorch#maxvit
- [ ] build out latent diffusion architecture in separate file, as it is not faithful to dalle-2 (but offer it as as setting)
- [ ] consider U2-net for decoder https://arxiv.org/abs/2005.09007 (also in separate file as experimental) build out https://github.com/lucidrains/x-unet
+- [ ] think about how best to design a declarative training config that handles preencoding for prior and training of multiple networks in decoder
+- [ ] extend diffusion head to use diffusion-gan (potentially using lightweight-gan) to speed up inference
+- [ ] bring in tools to train vqgan-vae

 ## Citations

@@ -449,20 +857,27 @@ Offer training wrappers

 ```bibtex
@inproceedings{Liu2022ACF,
-    title   = {A ConvNet for the 2020s},
+    title   = {A ConvNet for the 2020https://arxiv.org/abs/2112.11435s},
    author  = {Zhuang Liu and Hanzi Mao and Chaozheng Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
    year    = {2022}
 }
 ```

 ```bibtex
-@misc{zhang2019root,
-    title   = {Root Mean Square Layer Normalization},
-    author  = {Biao Zhang and Rico Sennrich},
-    year    = {2019},
-    eprint  = {1910.07467},
-    archivePrefix = {arXiv},
-    primaryClass = {cs.LG}
+@inproceedings{Tu2022MaxViTMV,
+    title   = {MaxViT: Multi-Axis Vision Transformer},
+    author  = {Zhe-Wei Tu and Hossein Talebi and Han Zhang and Feng Yang and Peyman Milanfar and Alan Conrad Bovik and Yinxiao Li},
+    year    = {2022}
+}
+```
+
+```bibtex
+@article{Yu2021VectorquantizedIM,
+    title   = {Vector-quantized Image Modeling with Improved VQGAN},
+    author  = {Jiahui Yu and Xin Li and Jing Yu Koh and Han Zhang and Ruoming Pang and James Qin and Alexander Ku and Yuanzhong Xu and Jason Baldridge and Yonghui Wu},
+    journal = {ArXiv},
+    year    = {2021},
+    volume  = {abs/2110.04627}
 }
 ```

--- a/dalle2_pytorch/init.py
+++ b/dalle2_pytorch/init.py
@@ -1,2 +1,6 @@
 from dalle2_pytorch.dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
+from dalle2_pytorch.dalle2_pytorch import OpenAIClipAdapter
+from dalle2_pytorch.train import DecoderTrainer
+
+from dalle2_pytorch.vqgan_vae import VQGanVAE
 from x_clip import CLIP
--- a/dalle2_pytorch/cli.py
+++ b/dalle2_pytorch/cli.py
@@ -1,9 +1,51 @@
 import click
+import torch
+import torchvision.transforms as T
+from pathlib import Path
+
+from dalle2_pytorch import DALLE2, Decoder, DiffusionPrior
+
+def safeget(dictionary, keys, default = None):
+    return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split('.'), dictionary)
+
+def simple_slugify(text, max_length = 255):
+    return text.replace("-", "_").replace(",", "").replace(" ", "_").replace("|", "--").strip('-_')[:max_length]
+
+def get_pkg_version():
+    from pkg_resources import get_distribution
+    return get_distribution('dalle2_pytorch').version

 def main():
    pass

@click.command()
+@click.option('--model', default = './dalle2.pt', help = 'path to trained DALL-E2 model')
+@click.option('--cond_scale', default = 2, help = 'conditioning scale (classifier free guidance) in decoder')
@click.argument('text')
-def dream(text):
-    return image
+def dream(
+    model,
+    cond_scale,
+    text
+):
+    model_path = Path(model)
+    full_model_path = str(model_path.resolve())
+    assert model_path.exists(), f'model not found at {full_model_path}'
+    loaded = torch.load(str(model_path))
+
+    version = safeget(loaded, 'version')
+    print(f'loading DALL-E2 from {full_model_path}, saved at version {version} - current package version is {get_pkg_version()}')
+
+    prior_init_params = safeget(loaded, 'init_params.prior')
+    decoder_init_params = safeget(loaded, 'init_params.decoder')
+    model_params = safeget(loaded, 'model_params')
+
+    prior = DiffusionPrior(**prior_init_params)
+    decoder = Decoder(**decoder_init_params)
+
+    dalle2 = DALLE2(prior, decoder)
+    dalle2.load_state_dict(model_params)
+
+    image = dalle2(text, cond_scale = cond_scale)
+
+    pil_image = T.ToPILImage()(image)
+    return pil_image.save(f'./{simple_slugify(text)}.png')
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
--- a/dalle2_pytorch/optimizer.py
+++ b/dalle2_pytorch/optimizer.py
@@ -0,0 +1,29 @@
+from torch.optim import AdamW, Adam
+
+def separate_weight_decayable_params(params):
+    no_wd_params = set([param for param in params if param.ndim < 2])
+    wd_params = set(params) - no_wd_params
+    return wd_params, no_wd_params
+
+def get_optimizer(
+    params,
+    lr = 3e-4,
+    wd = 1e-2,
+    betas = (0.9, 0.999),
+    filter_by_requires_grad = False
+):
+    if filter_by_requires_grad:
+        params = list(filter(lambda t: t.requires_grad, params))
+
+    if wd == 0:
+        return Adam(params, lr = lr, betas = betas)
+
+    params = set(params)
+    wd_params, no_wd_params = separate_weight_decayable_params(params)
+
+    param_groups = [
+        {'params': list(wd_params)},
+        {'params': list(no_wd_params), 'weight_decay': 0},
+    ]
+
+    return AdamW(param_groups, lr = lr, weight_decay = wd, betas = betas)
--- a/dalle2_pytorch/train.py
+++ b/dalle2_pytorch/train.py
@@ -0,0 +1,198 @@
+import copy
+from functools import partial
+
+import torch
+from torch import nn
+from torch.cuda.amp import autocast, GradScaler
+
+from dalle2_pytorch.dalle2_pytorch import Decoder
+from dalle2_pytorch.optimizer import get_optimizer
+
+# helper functions
+
+def exists(val):
+    return val is not None
+
+def cast_tuple(val, length = 1):
+    return val if isinstance(val, tuple) else ((val,) * length)
+
+def pick_and_pop(keys, d):
+    values = list(map(lambda key: d.pop(key), keys))
+    return dict(zip(keys, values))
+
+def group_dict_by_key(cond, d):
+    return_val = [dict(),dict()]
+    for key in d.keys():
+        match = bool(cond(key))
+        ind = int(not match)
+        return_val[ind][key] = d[key]
+    return (*return_val,)
+
+def string_begins_with(prefix, str):
+    return str.startswith(prefix)
+
+def group_by_key_prefix(prefix, d):
+    return group_dict_by_key(partial(string_begins_with, prefix), d)
+
+def groupby_prefix_and_trim(prefix, d):
+    kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d)
+    kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
+    return kwargs_without_prefix, kwargs
+
+# exponential moving average wrapper
+
+class EMA(nn.Module):
+    def __init__(
+        self,
+        model,
+        beta = 0.99,
+        update_after_step = 1000,
+        update_every = 10,
+    ):
+        super().__init__()
+        self.beta = beta
+        self.online_model = model
+        self.ema_model = copy.deepcopy(model)
+
+        self.update_after_step = update_after_step # only start EMA after this step number, starting at 0
+        self.update_every = update_every
+
+        self.register_buffer('initted', torch.Tensor([False]))
+        self.register_buffer('step', torch.tensor([0.]))
+
+    def update(self):
+        self.step += 1
+
+        if self.step <= self.update_after_step or (self.step % self.update_every) != 0:
+            return
+
+        if not self.initted:
+            self.ema_model.state_dict(self.online_model.state_dict())
+            self.initted.data.copy_(torch.Tensor([True]))
+
+        self.update_moving_average(self.ema_model, self.online_model)
+
+    def update_moving_average(self, ma_model, current_model):
+        def calculate_ema(beta, old, new):
+            if not exists(old):
+                return new
+            return old * beta + (1 - beta) * new
+
+        for current_params, ma_params in zip(current_model.parameters(), ma_model.parameters()):
+            old_weight, up_weight = ma_params.data, current_params.data
+            ma_params.data = calculate_ema(self.beta, old_weight, up_weight)
+
+        for current_buffer, ma_buffer in zip(current_model.buffers(), ma_model.buffers()):
+            new_buffer_value = calculate_ema(self.beta, ma_buffer, current_buffer)
+            ma_buffer.copy_(new_buffer_value)
+
+    def __call__(self, *args, **kwargs):
+        return self.ema_model(*args, **kwargs)
+
+# trainers
+
+class DecoderTrainer(nn.Module):
+    def __init__(
+        self,
+        decoder,
+        use_ema = True,
+        lr = 3e-4,
+        wd = 1e-2,
+        max_grad_norm = None,
+        amp = False,
+        **kwargs
+    ):
+        super().__init__()
+        assert isinstance(decoder, Decoder)
+        ema_kwargs, kwargs = groupby_prefix_and_trim('ema_', kwargs)
+
+        self.decoder = decoder
+        self.num_unets = len(self.decoder.unets)
+
+        self.use_ema = use_ema
+
+        if use_ema:
+            has_lazy_linear = any([type(module) == nn.LazyLinear for module in decoder.modules()])
+            assert not has_lazy_linear, 'you must set the text_embed_dim on your u-nets if you plan on doing automatic exponential moving average'
+
+        self.ema_unets = nn.ModuleList([])
+
+        self.amp = amp
+
+        # be able to finely customize learning rate, weight decay
+        # per unet
+
+        lr, wd = map(partial(cast_tuple, length = self.num_unets), (lr, wd))
+
+        for ind, (unet, unet_lr, unet_wd) in enumerate(zip(self.decoder.unets, lr, wd)):
+            optimizer = get_optimizer(
+                unet.parameters(),
+                lr = unet_lr,
+                wd = unet_wd,
+                **kwargs
+            )
+
+            setattr(self, f'optim{ind}', optimizer) # cannot use pytorch ModuleList for some reason with optimizers
+
+            if self.use_ema:
+                self.ema_unets.append(EMA(unet, **ema_kwargs))
+
+            scaler = GradScaler(enabled = amp)
+            setattr(self, f'scaler{ind}', scaler)
+
+        # gradient clipping if needed
+
+        self.max_grad_norm = max_grad_norm
+
+    @property
+    def unets(self):
+        return nn.ModuleList([ema.ema_model for ema in self.ema_unets])
+
+    def scale(self, loss, *, unet_number):
+        assert 1 <= unet_number <= self.num_unets
+        index = unet_number - 1
+        scaler = getattr(self, f'scaler{index}')
+        return scaler.scale(loss)
+
+    def update(self, unet_number):
+        assert 1 <= unet_number <= self.num_unets
+        index = unet_number - 1
+        unet = self.decoder.unets[index]
+
+        if exists(self.max_grad_norm):
+            nn.utils.clip_grad_norm_(unet.parameters(), self.max_grad_norm)
+
+        optimizer = getattr(self, f'optim{index}')
+        scaler = getattr(self, f'scaler{index}')
+
+        scaler.step(optimizer)
+        scaler.update()
+        optimizer.zero_grad()
+
+        if self.use_ema:
+            ema_unet = self.ema_unets[index]
+            ema_unet.update()
+
+    @torch.no_grad()
+    def sample(self, *args, **kwargs):
+        if self.use_ema:
+            trainable_unets = self.decoder.unets
+            self.decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling
+
+        output = self.decoder.sample(*args, **kwargs)
+
+        if self.use_ema:
+            self.decoder.unets = trainable_unets             # restore original training unets
+        return output
+
+    def forward(
+        self,
+        x,
+        *,
+        unet_number,
+        divisor = 1,
+        **kwargs
+    ):
+        with autocast(enabled = self.amp):
+            loss = self.decoder(x, unet_number = unet_number, **kwargs)
+        return self.scale(loss / divisor, unet_number = unet_number)
--- a/dalle2_pytorch/vqgan_vae.py
+++ b/dalle2_pytorch/vqgan_vae.py
@@ -0,0 +1,757 @@
+import copy
+import math
+from math import sqrt
+from functools import partial, wraps
+
+from vector_quantize_pytorch import VectorQuantize as VQ
+
+import torch
+from torch import nn, einsum
+import torch.nn.functional as F
+from torch.autograd import grad as torch_grad
+import torchvision
+
+from einops import rearrange, reduce, repeat
+from einops_exts import rearrange_many
+from einops.layers.torch import Rearrange
+
+# constants
+
+MList = nn.ModuleList
+
+# helper functions
+
+def exists(val):
+    return val is not None
+
+def default(val, d):
+    return val if exists(val) else d
+
+# decorators
+
+def eval_decorator(fn):
+    def inner(model, *args, **kwargs):
+        was_training = model.training
+        model.eval()
+        out = fn(model, *args, **kwargs)
+        model.train(was_training)
+        return out
+    return inner
+
+def remove_vgg(fn):
+    @wraps(fn)
+    def inner(self, *args, **kwargs):
+        has_vgg = hasattr(self, 'vgg')
+        if has_vgg:
+            vgg = self.vgg
+            delattr(self, 'vgg')
+
+        out = fn(self, *args, **kwargs)
+
+        if has_vgg:
+            self.vgg = vgg
+
+        return out
+    return inner
+
+# keyword argument helpers
+
+def pick_and_pop(keys, d):
+    values = list(map(lambda key: d.pop(key), keys))
+    return dict(zip(keys, values))
+
+def group_dict_by_key(cond, d):
+    return_val = [dict(),dict()]
+    for key in d.keys():
+        match = bool(cond(key))
+        ind = int(not match)
+        return_val[ind][key] = d[key]
+    return (*return_val,)
+
+def string_begins_with(prefix, str):
+    return str.startswith(prefix)
+
+def group_by_key_prefix(prefix, d):
+    return group_dict_by_key(partial(string_begins_with, prefix), d)
+
+def groupby_prefix_and_trim(prefix, d):
+    kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d)
+    kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
+    return kwargs_without_prefix, kwargs
+
+# tensor helper functions
+
+def log(t, eps = 1e-10):
+    return torch.log(t + eps)
+
+def gradient_penalty(images, output, weight = 10):
+    batch_size = images.shape[0]
+    gradients = torch_grad(outputs = output, inputs = images,
+                           grad_outputs = torch.ones(output.size(), device = images.device),
+                           create_graph = True, retain_graph = True, only_inputs = True)[0]
+
+    gradients = rearrange(gradients, 'b ... -> b (...)')
+    return weight * ((gradients.norm(2, dim = 1) - 1) ** 2).mean()
+
+def l2norm(t):
+    return F.normalize(t, dim = -1)
+
+def leaky_relu(p = 0.1):
+    return nn.LeakyReLU(0.1)
+
+def stable_softmax(t, dim = -1, alpha = 32 ** 2):
+    t = t / alpha
+    t = t - torch.amax(t, dim = dim, keepdim = True).detach()
+    return (t * alpha).softmax(dim = dim)
+
+def safe_div(numer, denom, eps = 1e-8):
+    return numer / (denom + eps)
+
+# gan losses
+
+def hinge_discr_loss(fake, real):
+    return (F.relu(1 + fake) + F.relu(1 - real)).mean()
+
+def hinge_gen_loss(fake):
+    return -fake.mean()
+
+def bce_discr_loss(fake, real):
+    return (-log(1 - torch.sigmoid(fake)) - log(torch.sigmoid(real))).mean()
+
+def bce_gen_loss(fake):
+    return -log(torch.sigmoid(fake)).mean()
+
+def grad_layer_wrt_loss(loss, layer):
+    return torch_grad(
+        outputs = loss,
+        inputs = layer,
+        grad_outputs = torch.ones_like(loss),
+        retain_graph = True
+    )[0].detach()
+
+# vqgan vae
+
+class LayerNormChan(nn.Module):
+    def __init__(
+        self,
+        dim,
+        eps = 1e-5
+    ):
+        super().__init__()
+        self.eps = eps
+        self.gamma = nn.Parameter(torch.ones(1, dim, 1, 1))
+
+    def forward(self, x):
+        var = torch.var(x, dim = 1, unbiased = False, keepdim = True)
+        mean = torch.mean(x, dim = 1, keepdim = True)
+        return (x - mean) / (var + self.eps).sqrt() * self.gamma
+
+# discriminator
+
+class Discriminator(nn.Module):
+    def __init__(
+        self,
+        dims,
+        channels = 3,
+        groups = 16,
+        init_kernel_size = 5
+    ):
+        super().__init__()
+        dim_pairs = zip(dims[:-1], dims[1:])
+
+        self.layers = MList([nn.Sequential(nn.Conv2d(channels, dims[0], init_kernel_size, padding = init_kernel_size // 2), leaky_relu())])
+
+        for dim_in, dim_out in dim_pairs:
+            self.layers.append(nn.Sequential(
+                nn.Conv2d(dim_in, dim_out, 4, stride = 2, padding = 1),
+                nn.GroupNorm(groups, dim_out),
+                leaky_relu()
+            ))
+
+        dim = dims[-1]
+        self.to_logits = nn.Sequential( # return 5 x 5, for PatchGAN-esque training
+            nn.Conv2d(dim, dim, 1),
+            leaky_relu(),
+            nn.Conv2d(dim, 1, 4)
+        )
+
+    def forward(self, x):
+        for net in self.layers:
+            x = net(x)
+
+        return self.to_logits(x)
+
+# positional encoding
+
+class ContinuousPositionBias(nn.Module):
+    """ from https://arxiv.org/abs/2111.09883 """
+
+    def __init__(self, *, dim, heads, layers = 2):
+        super().__init__()
+        self.net = MList([])
+        self.net.append(nn.Sequential(nn.Linear(2, dim), leaky_relu()))
+
+        for _ in range(layers - 1):
+            self.net.append(nn.Sequential(nn.Linear(dim, dim), leaky_relu()))
+
+        self.net.append(nn.Linear(dim, heads))
+        self.register_buffer('rel_pos', None, persistent = False)
+
+    def forward(self, x):
+        n, device = x.shape[-1], x.device
+        fmap_size = int(sqrt(n))
+
+        if not exists(self.rel_pos):
+            pos = torch.arange(fmap_size, device = device)
+            grid = torch.stack(torch.meshgrid(pos, pos, indexing = 'ij'))
+            grid = rearrange(grid, 'c i j -> (i j) c')
+            rel_pos = rearrange(grid, 'i c -> i 1 c') - rearrange(grid, 'j c -> 1 j c')
+            rel_pos = torch.sign(rel_pos) * torch.log(rel_pos.abs() + 1)
+            self.register_buffer('rel_pos', rel_pos, persistent = False)
+
+        rel_pos = self.rel_pos.float()
+
+        for layer in self.net:
+            rel_pos = layer(rel_pos)
+
+        bias = rearrange(rel_pos, 'i j h -> h i j')
+        return x + bias
+
+# resnet encoder / decoder
+
+class ResnetEncDec(nn.Module):
+    def __init__(
+        self,
+        dim,
+        *,
+        channels = 3,
+        layers = 4,
+        layer_mults = None,
+        num_resnet_blocks = 1,
+        resnet_groups = 16,
+        first_conv_kernel_size = 5,
+        use_attn = True,
+        attn_dim_head = 64,
+        attn_heads = 8,
+        attn_dropout = 0.,
+    ):
+        super().__init__()
+        assert dim % resnet_groups == 0, f'dimension {dim} must be divisible by {resnet_groups} (groups for the groupnorm)'
+
+        self.layers = layers
+
+        self.encoders = MList([])
+        self.decoders = MList([])
+
+        layer_mults = default(layer_mults, list(map(lambda t: 2 ** t, range(layers))))
+        assert len(layer_mults) == layers, 'layer multipliers must be equal to designated number of layers'
+
+        layer_dims = [dim * mult for mult in layer_mults]
+        dims = (dim, *layer_dims)
+
+        self.encoded_dim = dims[-1]
+
+        dim_pairs = zip(dims[:-1], dims[1:])
+
+        append = lambda arr, t: arr.append(t)
+        prepend = lambda arr, t: arr.insert(0, t)
+
+        if not isinstance(num_resnet_blocks, tuple):
+            num_resnet_blocks = (*((0,) * (layers - 1)), num_resnet_blocks)
+
+        if not isinstance(use_attn, tuple):
+            use_attn = (*((False,) * (layers - 1)), use_attn)
+
+        assert len(num_resnet_blocks) == layers, 'number of resnet blocks config must be equal to number of layers'
+        assert len(use_attn) == layers
+
+        for layer_index, (dim_in, dim_out), layer_num_resnet_blocks, layer_use_attn in zip(range(layers), dim_pairs, num_resnet_blocks, use_attn):
+            append(self.encoders, nn.Sequential(nn.Conv2d(dim_in, dim_out, 4, stride = 2, padding = 1), leaky_relu()))
+            prepend(self.decoders, nn.Sequential(nn.ConvTranspose2d(dim_out, dim_in, 4, 2, 1), leaky_relu()))
+
+            if layer_use_attn:
+                prepend(self.decoders, VQGanAttention(dim = dim_out, heads = attn_heads, dim_head = attn_dim_head, dropout = attn_dropout))
+
+            for _ in range(layer_num_resnet_blocks):
+                append(self.encoders, ResBlock(dim_out, groups = resnet_groups))
+                prepend(self.decoders, GLUResBlock(dim_out, groups = resnet_groups))
+
+            if layer_use_attn:
+                append(self.encoders, VQGanAttention(dim = dim_out, heads = attn_heads, dim_head = attn_dim_head, dropout = attn_dropout))
+
+        prepend(self.encoders, nn.Conv2d(channels, dim, first_conv_kernel_size, padding = first_conv_kernel_size // 2))
+        append(self.decoders, nn.Conv2d(dim, channels, 1))
+
+    def get_encoded_fmap_size(self, image_size):
+        return image_size // (2 ** self.layers)
+
+    def encode(self, x):
+        for enc in self.encoders:
+            x = enc(x)
+        return x
+
+    def decode(self, x):
+        for dec in self.decoders:
+            x = dec(x)
+        return x
+
+class GLUResBlock(nn.Module):
+    def __init__(self, chan, groups = 16):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Conv2d(chan, chan * 2, 3, padding = 1),
+            nn.GLU(dim = 1),
+            nn.GroupNorm(groups, chan),
+            nn.Conv2d(chan, chan * 2, 3, padding = 1),
+            nn.GLU(dim = 1),
+            nn.GroupNorm(groups, chan),
+            nn.Conv2d(chan, chan, 1)
+        )
+
+    def forward(self, x):
+        return self.net(x) + x
+
+class ResBlock(nn.Module):
+    def __init__(self, chan, groups = 16):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Conv2d(chan, chan, 3, padding = 1),
+            nn.GroupNorm(groups, chan),
+            leaky_relu(),
+            nn.Conv2d(chan, chan, 3, padding = 1),
+            nn.GroupNorm(groups, chan),
+            leaky_relu(),
+            nn.Conv2d(chan, chan, 1)
+        )
+
+    def forward(self, x):
+        return self.net(x) + x
+
+# vqgan attention layer
+
+class VQGanAttention(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        dim_head = 64,
+        heads = 8,
+        dropout = 0.
+    ):
+        super().__init__()
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        inner_dim = heads * dim_head
+
+        self.dropout = nn.Dropout(dropout)
+        self.pre_norm = LayerNormChan(dim)
+
+        self.cpb = ContinuousPositionBias(dim = dim // 4, heads = heads)
+        self.to_qkv = nn.Conv2d(dim, inner_dim * 3, 1, bias = False)
+        self.to_out = nn.Conv2d(inner_dim, dim, 1, bias = False)
+
+    def forward(self, x):
+        h = self.heads
+        height, width, residual = *x.shape[-2:], x.clone()
+
+        x = self.pre_norm(x)
+
+        q, k, v = self.to_qkv(x).chunk(3, dim = 1)
+
+        q, k, v = map(lambda t: rearrange(t, 'b (h c) x y -> b h c (x y)', h = h), (q, k, v))
+
+        sim = einsum('b h c i, b h c j -> b h i j', q, k) * self.scale
+
+        sim = self.cpb(sim)
+
+        attn = stable_softmax(sim, dim = -1)
+        attn = self.dropout(attn)
+
+        out = einsum('b h i j, b h c j -> b h c i', attn, v)
+        out = rearrange(out, 'b h c (x y) -> b (h c) x y', x = height, y = width)
+        out = self.to_out(out)
+
+        return out + residual
+
+# ViT encoder / decoder
+
+class RearrangeImage(nn.Module):
+    def forward(self, x):
+        n = x.shape[1]
+        w = h = int(sqrt(n))
+        return rearrange(x, 'b (h w) ... -> b h w ...', h = h, w = w)
+
+class Attention(nn.Module):
+    def __init__(
+        self,
+        dim,
+        *,
+        heads = 8,
+        dim_head = 32
+    ):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        inner_dim = dim_head * heads
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+        self.to_out = nn.Linear(inner_dim, dim)
+
+    def forward(self, x):
+        h = self.heads
+
+        x = self.norm(x)
+
+        q, k, v = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = rearrange_many((q, k, v), 'b n (h d) -> b h n d', h = h)
+
+        q = q * self.scale
+        sim = einsum('b h i d, b h j d -> b h i j', q, k)
+
+        sim = sim - sim.amax(dim = -1, keepdim = True).detach()
+        attn = sim.softmax(dim = -1)
+
+        out = einsum('b h i j, b h j d -> b h i d', attn, v)
+
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+def FeedForward(dim, mult = 4):
+    return nn.Sequential(
+        nn.LayerNorm(dim),
+        nn.Linear(dim, dim * mult, bias = False),
+        nn.GELU(),
+        nn.Linear(dim * mult, dim, bias = False)
+    )
+
+class Transformer(nn.Module):
+    def __init__(
+        self,
+        dim,
+        *,
+        layers,
+        dim_head = 32,
+        heads = 8,
+        ff_mult = 4
+    ):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(layers):
+            self.layers.append(nn.ModuleList([
+                Attention(dim = dim, dim_head = dim_head, heads = heads),
+                FeedForward(dim = dim, mult = ff_mult)
+            ]))
+
+        self.norm = nn.LayerNorm(dim)
+
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+
+        return self.norm(x)
+
+class ViTEncDec(nn.Module):
+    def __init__(
+        self,
+        dim,
+        channels = 3,
+        layers = 4,
+        patch_size = 8,
+        dim_head = 32,
+        heads = 8,
+        ff_mult = 4
+    ):
+        super().__init__()
+        self.encoded_dim = dim
+        self.patch_size = patch_size
+
+        input_dim = channels * (patch_size ** 2)
+
+        self.encoder = nn.Sequential(
+            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
+            nn.Linear(input_dim, dim),
+            Transformer(
+                dim = dim,
+                dim_head = dim_head,
+                heads = heads,
+                ff_mult = ff_mult,
+                layers = layers
+            ),
+            RearrangeImage(),
+            Rearrange('b h w c -> b c h w')
+        )
+
+        self.decoder = nn.Sequential(
+            Rearrange('b c h w -> b (h w) c'),
+            Transformer(
+                dim = dim,
+                dim_head = dim_head,
+                heads = heads,
+                ff_mult = ff_mult,
+                layers = layers
+            ),
+            nn.Sequential(
+                nn.Linear(dim, dim * 4, bias = False),
+                nn.Tanh(),
+                nn.Linear(dim * 4, input_dim, bias = False),
+            ),
+            RearrangeImage(),
+            Rearrange('b h w (p1 p2 c) -> b c (h p1) (w p2)', p1 = patch_size, p2 = patch_size)
+        )
+
+    def get_encoded_fmap_size(self, image_size):
+        return image_size // self.patch_size
+
+    def encode(self, x):
+        return self.encoder(x)
+
+    def decode(self, x):
+        return self.decoder(x)
+
+# main vqgan-vae classes
+
+class NullVQGanVAE(nn.Module):
+    def __init__(
+        self,
+        *,
+        channels
+    ):
+        super().__init__()
+        self.encoded_dim = channels
+        self.layers = 0
+
+    def get_encoded_fmap_size(self, size):
+        return size
+
+    def copy_for_eval(self):
+        return self
+
+    def encode(self, x):
+        return x
+
+    def decode(self, x):
+        return x
+
+class VQGanVAE(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        image_size,
+        channels = 3,
+        layers = 4,
+        l2_recon_loss = False,
+        use_hinge_loss = True,
+        vgg = None,
+        vq_codebook_dim = 256,
+        vq_codebook_size = 512,
+        vq_decay = 0.8,
+        vq_commitment_weight = 1.,
+        vq_kmeans_init = True,
+        vq_use_cosine_sim = True,
+        use_vgg_and_gan = True,
+        vae_type = 'resnet',
+        discr_layers = 4,
+        **kwargs
+    ):
+        super().__init__()
+        vq_kwargs, kwargs = groupby_prefix_and_trim('vq_', kwargs)
+        encdec_kwargs, kwargs = groupby_prefix_and_trim('encdec_', kwargs)
+
+        self.image_size = image_size
+        self.channels = channels
+        self.codebook_size = vq_codebook_size
+
+        if vae_type == 'resnet':
+            enc_dec_klass = ResnetEncDec
+        elif vae_type == 'vit':
+            enc_dec_klass = ViTEncDec
+        else:
+            raise ValueError(f'{vae_type} not valid')
+
+        self.enc_dec = enc_dec_klass(
+            dim = dim,
+            channels = channels,
+            layers = layers,
+            **encdec_kwargs
+        )
+
+        self.vq = VQ(
+            dim = self.enc_dec.encoded_dim,
+            codebook_dim = vq_codebook_dim,
+            codebook_size = vq_codebook_size,
+            decay = vq_decay,
+            commitment_weight = vq_commitment_weight,
+            accept_image_fmap = True,
+            kmeans_init = vq_kmeans_init,
+            use_cosine_sim = vq_use_cosine_sim,
+            **vq_kwargs
+        )
+
+        # reconstruction loss
+
+        self.recon_loss_fn = F.mse_loss if l2_recon_loss else F.l1_loss
+
+        # turn off GAN and perceptual loss if grayscale
+
+        self.vgg = None
+        self.discr = None
+        self.use_vgg_and_gan = use_vgg_and_gan
+
+        if not use_vgg_and_gan:
+            return
+
+        # preceptual loss
+
+        if exists(vgg):
+            self.vgg = vgg
+        else:
+            self.vgg = torchvision.models.vgg16(pretrained = True)
+            self.vgg.classifier = nn.Sequential(*self.vgg.classifier[:-2])
+
+        # gan related losses
+
+        layer_mults = list(map(lambda t: 2 ** t, range(discr_layers)))
+        layer_dims = [dim * mult for mult in layer_mults]
+        dims = (dim, *layer_dims)
+
+        self.discr = Discriminator(dims = dims, channels = channels)
+
+        self.discr_loss = hinge_discr_loss if use_hinge_loss else bce_discr_loss
+        self.gen_loss = hinge_gen_loss if use_hinge_loss else bce_gen_loss
+
+    @property
+    def encoded_dim(self):
+        return self.enc_dec.encoded_dim
+
+    def get_encoded_fmap_size(self, image_size):
+        return self.enc_dec.get_encoded_fmap_size(image_size)
+
+    def copy_for_eval(self):
+        device = next(self.parameters()).device
+        vae_copy = copy.deepcopy(self.cpu())
+
+        if vae_copy.use_vgg_and_gan:
+            del vae_copy.discr
+            del vae_copy.vgg
+
+        vae_copy.eval()
+        return vae_copy.to(device)
+
+    @remove_vgg
+    def state_dict(self, *args, **kwargs):
+        return super().state_dict(*args, **kwargs)
+
+    @remove_vgg
+    def load_state_dict(self, *args, **kwargs):
+        return super().load_state_dict(*args, **kwargs)
+
+    @property
+    def codebook(self):
+        return self.vq.codebook
+
+    def encode(self, fmap):
+        fmap = self.enc_dec.encode(fmap)
+        return fmap
+
+    def decode(self, fmap, return_indices_and_loss = False):
+        fmap, indices, commit_loss = self.vq(fmap)
+
+        fmap = self.enc_dec.decode(fmap)
+
+        if not return_indices_and_loss:
+            return fmap
+
+        return fmap, indices, commit_loss
+
+    def forward(
+        self,
+        img,
+        return_loss = False,
+        return_discr_loss = False,
+        return_recons = False,
+        add_gradient_penalty = True
+    ):
+        batch, channels, height, width, device = *img.shape, img.device
+        assert height == self.image_size and width == self.image_size, 'height and width of input image must be equal to {self.image_size}'
+        assert channels == self.channels, 'number of channels on image or sketch is not equal to the channels set on this VQGanVAE'
+
+        fmap = self.encode(img)
+
+        fmap, indices, commit_loss = self.decode(fmap, return_indices_and_loss = True)
+
+        if not return_loss and not return_discr_loss:
+            return fmap
+
+        assert return_loss ^ return_discr_loss, 'you should either return autoencoder loss or discriminator loss, but not both'
+
+        # whether to return discriminator loss
+
+        if return_discr_loss:
+            assert exists(self.discr), 'discriminator must exist to train it'
+
+            fmap.detach_()
+            img.requires_grad_()
+
+            fmap_discr_logits, img_discr_logits = map(self.discr, (fmap, img))
+
+            discr_loss = self.discr_loss(fmap_discr_logits, img_discr_logits)
+
+            if add_gradient_penalty:
+                gp = gradient_penalty(img, img_discr_logits)
+                loss = discr_loss + gp
+
+            if return_recons:
+                return loss, fmap
+
+            return loss
+
+        # reconstruction loss
+
+        recon_loss = self.recon_loss_fn(fmap, img)
+
+        # early return if training on grayscale
+
+        if not self.use_vgg_and_gan:
+            if return_recons:
+                return recon_loss, fmap
+
+            return recon_loss
+
+        # perceptual loss
+
+        img_vgg_input = img
+        fmap_vgg_input = fmap
+
+        if img.shape[1] == 1:
+            # handle grayscale for vgg
+            img_vgg_input, fmap_vgg_input = map(lambda t: repeat(t, 'b 1 ... -> b c ...', c = 3), (img_vgg_input, fmap_vgg_input))
+
+        img_vgg_feats = self.vgg(img_vgg_input)
+        recon_vgg_feats = self.vgg(fmap_vgg_input)
+        perceptual_loss = F.mse_loss(img_vgg_feats, recon_vgg_feats)
+
+        # generator loss
+
+        gen_loss = self.gen_loss(self.discr(fmap))
+
+        # calculate adaptive weight
+
+        last_dec_layer = self.decoders[-1].weight
+
+        norm_grad_wrt_gen_loss = grad_layer_wrt_loss(gen_loss, last_dec_layer).norm(p = 2)
+        norm_grad_wrt_perceptual_loss = grad_layer_wrt_loss(perceptual_loss, last_dec_layer).norm(p = 2)
+
+        adaptive_weight = safe_div(norm_grad_wrt_perceptual_loss, norm_grad_wrt_gen_loss)
+        adaptive_weight.clamp_(max = 1e4)
+
+        # combine losses
+
+        loss = recon_loss + perceptual_loss + commit_loss + adaptive_weight * gen_loss
+
+        if return_recons:
+            return loss, fmap
+
+        return loss
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.0.20',
+  version = '0.0.81',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
@@ -23,6 +23,7 @@ setup(
  ],
  install_requires=[
    'click',
+    'clip-anytorch',
    'einops>=0.4',
    'einops-exts>=0.0.3',
    'kornia>=0.5.4',
@@ -30,7 +31,8 @@ setup(
    'torch>=1.10',
    'torchvision',
    'tqdm',
-    'x-clip>=0.4.4',
+    'vector-quantize-pytorch',
+    'x-clip>=0.5.1',
    'youtokentome'
  ],
  classifiers=[
Author	SHA1	Message	Date
Phil Wang	ebe01749ed	DecoderTrainer sample method uses the exponentially moving averaged	2022-04-30 14:55:34 -07:00
Phil Wang	63195cc2cb	allow for division of loss prior to scaling, for gradient accumulation purposes	2022-04-30 12:56:47 -07:00
Phil Wang	a2ef69af66	take care of mixed precision, and make gradient accumulation do-able externally	2022-04-30 12:27:24 -07:00
Phil Wang	5fff22834e	be able to finely customize learning parameters for each unet, take care of gradient clipping	2022-04-30 11:56:05 -07:00
Phil Wang	a9421f49ec	simplify Decoder training for the public	2022-04-30 11:45:18 -07:00
Phil Wang	77fa34eae9	fix all clipping / clamping issues	2022-04-30 10:08:24 -07:00
Phil Wang	1c1e508369	fix all issues with text encodings conditioning in the decoder, using null padding tokens technique from dalle v1	2022-04-30 09:13:34 -07:00
Phil Wang	f19c99ecb0	fix decoder needing separate conditional dropping probabilities for image embeddings and text encodings, thanks to @xiankgx !	2022-04-30 08:48:05 -07:00
Phil Wang	721a444686	Merge pull request #37 from ProGamerGov/patch-1 Fix spelling and grammatical errors	2022-04-30 08:19:07 -07:00
ProGamerGov	63450b466d	Fix spelling and grammatical errors	2022-04-30 09:18:13 -06:00
Phil Wang	20e7eb5a9b	cleanup	2022-04-30 07:22:57 -07:00
Phil Wang	e2f9615afa	use @clip-anytorch , thanks to @rom1504	2022-04-30 06:40:54 -07:00
Phil Wang	0d1c07c803	fix a bug with classifier free guidance, thanks to @xiankgx again!	2022-04-30 06:34:57 -07:00
Phil Wang	a389f81138	todo	2022-04-29 15:40:51 -07:00
Phil Wang	0283556608	fix example in readme, since api changed	2022-04-29 13:40:55 -07:00
Phil Wang	5063d192b6	now completely OpenAI CLIP compatible for training just take care of the logic for AdamW and transformers used namedtuples for clip adapter embedding outputs	2022-04-29 13:05:01 -07:00
Phil Wang	f4a54e475e	add some training fns	2022-04-29 09:44:55 -07:00
Phil Wang	fb662a62f3	fix another bug thanks to @xiankgx	2022-04-29 07:38:32 -07:00
Phil Wang	587c8c9b44	optimize for clarity	2022-04-28 21:59:13 -07:00
Phil Wang	aa900213e7	force first unet in the cascade to be conditioned on image embeds	2022-04-28 20:53:15 -07:00
Phil Wang	cb26187450	vqgan-vae codebook dims should be 256 or smaller	2022-04-28 08:59:03 -07:00
Phil Wang	625ce23f6b	🐛	2022-04-28 07:21:18 -07:00
Phil Wang	dbf4a281f1	make sure another CLIP can actually be passed in, as long as it is wrapped in an adapter extended from BaseClipAdapter	2022-04-27 20:45:27 -07:00
Phil Wang	4ab527e779	some extra asserts for text encoding of diffusion prior and decoder	2022-04-27 20:11:43 -07:00
Phil Wang	d0cdeb3247	add ability for DALL-E2 to return PIL images with `return_pil_images = True` on forward, for those who have no clue about deep learning	2022-04-27 19:58:06 -07:00
Phil Wang	8c610aad9a	only pass text encodings conditioning in diffusion prior if specified on initialization	2022-04-27 19:48:16 -07:00
Phil Wang	6700381a37	prepare for ability to integrate other clips other than x-clip	2022-04-27 19:35:05 -07:00
Phil Wang	20377f889a	todo	2022-04-27 17:22:14 -07:00
Phil Wang	6edb1c5dd0	fix issue with ema class	2022-04-27 16:40:02 -07:00
Phil Wang	b093f92182	inform what is possible	2022-04-27 08:25:16 -07:00
Phil Wang	fa3bb6ba5c	make sure cpu-only still works	2022-04-27 08:02:10 -07:00
Phil Wang	2705e7c9b0	attention-based upsampling claims unsupported by local experiments, removing	2022-04-27 07:51:04 -07:00
Phil Wang	77141882c8	complete vit-vqgan from https://arxiv.org/abs/2110.04627	2022-04-26 17:20:47 -07:00
Phil Wang	4075d02139	nevermind, it could be working, but only when i stabilize it with the feedforward layer + tanh as proposed in vit-vqgan paper (which will be built into the repository later for the latent diffusion)	2022-04-26 12:43:31 -07:00
Phil Wang	de0296106b	be able to turn off warning for use of LazyLinear by passing in text embedding dimension for unet	2022-04-26 11:42:46 -07:00
Phil Wang	eafb136214	suppress a warning	2022-04-26 11:40:45 -07:00
Phil Wang	bfbcc283a3	DRY a tiny bit for gaussian diffusion related logic	2022-04-26 11:39:12 -07:00
Phil Wang	c30544b73a	no CLIP altogether for training DiffusionPrior	2022-04-26 10:23:41 -07:00
Phil Wang	bdf5e9c009	todo	2022-04-26 09:56:54 -07:00
Phil Wang	9878be760b	have researcher explicitly state upfront whether to condition with text encodings in cascading ddpm decoder, have DALLE-2 class take care of passing in text if feature turned on	2022-04-26 09:47:09 -07:00
Phil Wang	7ba6357c05	allow for training the Prior network with precomputed CLIP embeddings (or text encodings)	2022-04-26 09:29:51 -07:00
Phil Wang	76e063e8b7	refactor so that the causal transformer in the diffusion prior network can be conditioned without text encodings (for Laions parallel efforts, although it seems from the paper it is needed)	2022-04-26 09:00:11 -07:00
Phil Wang	4d25976f33	make sure non-latent diffusion still works	2022-04-26 08:36:00 -07:00
Phil Wang	0b28ee0d01	revert back to old upsampling, paper does not work	2022-04-26 07:39:04 -07:00
Phil Wang	45262a4bb7	bring in the exponential moving average wrapper, to get ready for training	2022-04-25 19:24:13 -07:00
Phil Wang	13a58a78c4	scratch off todo	2022-04-25 19:01:30 -07:00
Phil Wang	f75d49c781	start a file for all attention-related modules, use attention-based upsampling in the unets in dalle-2	2022-04-25 18:59:10 -07:00
Phil Wang	3b520dfa85	bring in attention-based upsampling to strengthen vqgan-vae, seems to work as advertised in initial experiments in GAN	2022-04-25 17:27:45 -07:00
Phil Wang	79198c6ae4	keep readme simple for reader	2022-04-25 17:21:45 -07:00
Phil Wang	77a246b1b9	todo	2022-04-25 08:48:28 -07:00
Phil Wang	f93a3f6ed8	reprioritize	2022-04-25 08:44:27 -07:00
Phil Wang	8f2a0c7e00	better naming	2022-04-25 07:44:33 -07:00
Phil Wang	863f4ef243	just take care of the logic for setting all latent diffusion to predict x0, if needed	2022-04-24 10:06:42 -07:00
Phil Wang	fb8a66a2de	just in case latent diffusion performs better with prediction of x0 instead of epsilon, open up the research avenue	2022-04-24 10:04:22 -07:00
Phil Wang	579d4b42dd	does not seem right to clip for the prior diffusion part	2022-04-24 09:51:18 -07:00
Phil Wang	473808850a	some outlines to the eventual CLI endpoint	2022-04-24 09:27:15 -07:00
Phil Wang	d5318aef4f	todo	2022-04-23 08:23:08 -07:00
Phil Wang	f82917e1fd	prepare for turning off gradient penalty, as shown in GAN literature, GP needs to be only applied 1 out of 4 iterations	2022-04-23 07:52:10 -07:00
Phil Wang	05b74be69a	use null container pattern to cleanup some conditionals, save more cleanup for next week	2022-04-22 15:23:18 -07:00
Phil Wang	a8b5d5d753	last tweak of readme	2022-04-22 14:16:43 -07:00
Phil Wang	976ef7f87c	project management	2022-04-22 14:15:42 -07:00
Phil Wang	fd175bcc0e	readme	2022-04-22 14:13:33 -07:00
Phil Wang	76b32f18b3	first pass at complete DALL-E2 + Latent Diffusion integration, latent diffusion on any layer(s) of the cascading ddpm in the decoder.	2022-04-22 13:53:13 -07:00
Phil Wang	f2d5b87677	todo	2022-04-22 11:39:58 -07:00
Phil Wang	461347c171	fix vqgan-vae for latent diffusion	2022-04-22 11:38:57 -07:00
Phil Wang	46cef31c86	optional projection out for prior network causal transformer	2022-04-22 11:16:30 -07:00
Phil Wang	59b1a77d4d	be a bit more conservative and stick with layernorm (without bias) for now, given @borisdayma results https://twitter.com/borisdayma/status/1517227191477571585	2022-04-22 11:14:54 -07:00
Phil Wang	7f338319fd	makes more sense for blur augmentation to happen before the upsampling	2022-04-22 11:10:47 -07:00
Phil Wang	2c6c91829d	refactor blurring training augmentation to be taken care of by the decoder, with option to downsample to previous resolution before upsampling (cascading ddpm). this opens up the possibility of cascading latent ddpm	2022-04-22 11:09:17 -07:00
Phil Wang	ad17c69ab6	prepare for latent diffusion in the first DDPM of the cascade in the Decoder	2022-04-21 17:54:31 -07:00
Phil Wang	0b4ec34efb	todo	2022-04-20 12:24:23 -07:00
Phil Wang	f027b82e38	remove wip as main networks (prior and decoder) are completed	2022-04-20 12:12:16 -07:00
Phil Wang	8cc9016cb0	Merge pull request #17 from kashif/patch-2 added diffusion-gan thoughts	2022-04-20 12:10:26 -07:00
Kashif Rasul	1d8f37befe	added diffusion-gan thoughts https://github.com/NVlabs/denoising-diffusion-gan	2022-04-20 21:01:11 +02:00
Phil Wang	faebf4c8b8	from my vision transformer experience, dimension of attention head of 32 is sufficient for image feature maps	2022-04-20 11:40:32 -07:00
Phil Wang	b8e8d3c164	thoughts	2022-04-20 11:34:51 -07:00
Phil Wang	8e2416b49b	commit to generalizing latent diffusion to one model	2022-04-20 11:27:42 -07:00
Phil Wang	f37c26e856	cleanup and DRY a little	2022-04-20 10:56:32 -07:00
Phil Wang	27a33e1b20	complete contextmanager method for keeping only one unet in GPU during training or inference	2022-04-20 10:46:13 -07:00
Phil Wang	6f941a219a	give time tokens a surface area of 2 tokens as default, make it so researcher can customize which unet actually is conditioned on image embeddings and/or text encodings	2022-04-20 10:04:47 -07:00
Phil Wang	ddde8ca1bf	fix cosine bbeta schedule, thanks to @Zhengxinyang	2022-04-19 20:54:28 -07:00
Phil Wang	c26b77ad20	todo	2022-04-19 13:07:32 -07:00
Phil Wang	c5b4aab8e5	intent	2022-04-19 11:00:05 -07:00
Phil Wang	a35c309b5f	add sparse attention layers in between convnext blocks in unet (grid like attention, used in mobilevit, maxvit [bytedance ai], as well as a growing number of attention-based GANs)	2022-04-19 09:49:03 -07:00
Phil Wang	55bdcb98b9	scaffold for latent diffusion	2022-04-19 09:26:58 -07:00
Phil Wang	82328f16cd	same for text encodings for decoder ddpm training	2022-04-18 14:41:02 -07:00
Phil Wang	6fee4fce6e	also allow for image embedding to be passed into the diffusion model, in the case one wants to generate image embedding once and then train multiple unets in one iteration	2022-04-18 14:00:38 -07:00
Phil Wang	a54e309269	prioritize todos, play project management	2022-04-18 13:28:01 -07:00
Phil Wang	c6bfd7fdc8	readme	2022-04-18 12:43:10 -07:00
Phil Wang	960a79857b	use some magic just this once to remove the need for researchers to think	2022-04-18 12:40:43 -07:00
Phil Wang	7214df472d	todo	2022-04-18 12:18:19 -07:00
Phil Wang	00ae50999b	make kernel size and sigma for gaussian blur for cascading DDPM overridable at forward. also make sure unets are wrapped in a modulelist so that at sample time, blurring does not happen	2022-04-18 12:04:31 -07:00