fix random crop probability

allow for last unet in the cascade to be trained on crops, if it is convolution-only
product management
2026-02-12 11:34:29 +01:00 · 2022-05-04 11:52:24 -07:00 · 2022-05-04 11:48:48 -07:00 · 2022-05-04 11:20:50 -07:00 · 2022-05-04 11:18:54 -07:00 · 2022-05-04 11:18:32 -07:00
10 changed files with 3744 additions and 504 deletions
--- a/README.md
+++ b/README.md
@@ -1,20 +1,18 @@
 <img src="./dalle2.png" width="450px"></img>

-## DALL-E 2 - Pytorch (wip)
+## DALL-E 2 - Pytorch

-Implementation of <a href="https://openai.com/dall-e-2/">DALL-E 2</a>, OpenAI's updated text-to-image synthesis neural network, in Pytorch. <a href="https://youtu.be/RJwPN4qNi_Y?t=555">Yannic Kilcher summary</a>
+Implementation of <a href="https://openai.com/dall-e-2/">DALL-E 2</a>, OpenAI's updated text-to-image synthesis neural network, in Pytorch.
+
+<a href="https://youtu.be/RJwPN4qNi_Y?t=555">Yannic Kilcher summary</a> | <a href="https://www.youtube.com/watch?v=F1X4fHzF4mQ">AssemblyAI explainer</a>

 The main novelty seems to be an extra layer of indirection with the prior network (whether it is an autoregressive transformer or a diffusion network), which predicts an image embedding based on the text embedding from CLIP. Specifically, this repository will only build out the diffusion prior network, as it is the best performing variant (but which incidentally involves a causal transformer as the denoising network 😂)

 This model is SOTA for text-to-image for now.

-It may also explore an extension of using <a href="https://huggingface.co/spaces/multimodalart/latentdiffusion">latent diffusion</a> in the decoder from Rombach et al.
-
 Please join <a href="https://discord.gg/xBPBXfcFHd"><img alt="Join us on Discord" src="https://img.shields.io/discord/823813159592001537?color=5865F2&logo=discord&logoColor=white"></a> if you are interested in helping out with the replication

-Do let me know if anyone is interested in a Jax version https://github.com/lucidrains/DALLE2-pytorch/discussions/8
-
-For all of you emailing me (there is a lot), the best way to contribute is through pull requests. Everything is open sourced after all. All my thoughts are public. This is your moment to participate.
+There was enough interest for a <a href="https://github.com/lucidrains/dalle2-jax">Jax version</a>. I will also eventually extend this to <a href="https://github.com/lucidrains/dalle2-video">text to video</a>, once the repository is in a good place.

 ## Install

@@ -22,19 +20,11 @@ For all of you emailing me (there is a lot), the best way to contribute is throu
 $ pip install dalle2-pytorch
 ```

-## CLI Usage (work in progress)
-
-```bash
-$ dream 'sharing a sunset at the summit of mount everest with my dog'
-```
-
-Once built, images will be saved to the same directory the command is invoked
-
-## Training (for deep learning practitioners)
+## Usage

 To train DALLE-2 is a 3 step process, with the training of CLIP being the most important

-To train CLIP, you can either use <a href="https://github.com/lucidrains/x-clip">x-clip</a> package, or join the LAION discord, where a lot of replication efforts are already underway.
+To train CLIP, you can either use <a href="https://github.com/lucidrains/x-clip">x-clip</a> package, or join the LAION discord, where a lot of replication efforts are already <a href="https://github.com/mlfoundations/open_clip">underway</a>.

 This repository will demonstrate integration with `x-clip` for starters

@@ -57,7 +47,7 @@ clip = CLIP(
    use_all_token_embeds = True,            # whether to use fine-grained contrastive learning (FILIP)
    decoupled_contrastive_learning = True,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
    extra_latent_projection = True,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
-    use_visual_ssl = True,                  # whether to do self supervised learning on iages
+    use_visual_ssl = True,                  # whether to do self supervised learning on images
    visual_ssl_type = 'simclr',             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
    use_mlm = False,                        # use masked language learning (MLM) on text (DeCLIP)
    text_ssl_loss_weight = 0.05,            # weight for text MLM loss
@@ -109,7 +99,7 @@ clip = CLIP(
 unet = Unet(
    dim = 128,
    image_embed_dim = 512,
-    time_dim = 128,
+    cond_dim = 128,
    channels = 3,
    dim_mults=(1, 2, 4, 8)
 ).cuda()
@@ -117,10 +107,11 @@ unet = Unet(
 # decoder, which contains the unet and clip

 decoder = Decoder(
-    net = unet,
+    unet = unet,
    clip = clip,
    timesteps = 100,
-    cond_drop_prob = 0.2
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5
 ).cuda()

 # mock images (get a lot of this)
@@ -162,7 +153,6 @@ clip = CLIP(

 prior_network = DiffusionPriorNetwork(
    dim = 512,
-    num_timesteps = 100,
    depth = 6,
    dim_head = 64,
    heads = 8
@@ -191,7 +181,76 @@ loss.backward()
 # now the diffusion prior can generate image embeddings from the text embeddings
 ```

-Finally, to generate the DALL-E2 images from text. Insert the trained `DiffusionPrior` as well as the `Decoder` (which both contains `CLIP`, a unet, and a causal transformer)
+In the paper, they actually used a <a href="https://cascaded-diffusion.github.io/">recently discovered technique</a>, from <a href="http://www.jonathanho.me/">Jonathan Ho</a> himself (original author of DDPMs, the core technique used in DALL-E v2) for high resolution image synthesis.
+
+This can easily be used within this framework as so
+
+```python
+import torch
+from dalle2_pytorch import Unet, Decoder, CLIP
+
+# trained clip from step 1
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 6,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 6,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8
+).cuda()
+
+# 2 unets for the decoder (a la cascading DDPM)
+
+unet1 = Unet(
+    dim = 32,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8)
+).cuda()
+
+unet2 = Unet(
+    dim = 32,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16)
+).cuda()
+
+# decoder, which contains the unet(s) and clip
+
+decoder = Decoder(
+    clip = clip,
+    unet = (unet1, unet2),            # insert both unets in order of low resolution to highest resolution (you can have as many stages as you want here)
+    image_sizes = (256, 512),         # resolutions, 256 for first unet, 512 for second. these must be unique and in ascending order (matches with the unets passed in)
+    timesteps = 1000,
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5
+).cuda()
+
+# mock images (get a lot of this)
+
+images = torch.randn(4, 3, 512, 512).cuda()
+
+# feed images into decoder, specifying which unet you want to train
+# each unet can be trained separately, which is one of the benefits of the cascading DDPM scheme
+
+loss = decoder(images, unet_number = 1)
+loss.backward()
+
+loss = decoder(images, unet_number = 2)
+loss.backward()
+
+# do the above for many steps for both unets
+```
+
+Finally, to generate the DALL-E2 images from text. Insert the trained `DiffusionPrior` as well as the `Decoder` (which wraps `CLIP`, the causal transformer, and unet(s))

 ```python
 from dalle2_pytorch import DALLE2
@@ -251,7 +310,6 @@ loss.backward()

 prior_network = DiffusionPriorNetwork(
    dim = 512,
-    num_timesteps = 100,
    depth = 6,
    dim_head = 64,
    heads = 8
@@ -271,23 +329,35 @@ loss.backward()

 # decoder (with unet)

-unet = Unet(
+unet1 = Unet(
    dim = 128,
    image_embed_dim = 512,
-    time_dim = 128,
+    cond_dim = 128,
    channels = 3,
    dim_mults=(1, 2, 4, 8)
 ).cuda()

-decoder = Decoder(
-    net = unet,
-    clip = clip,
-    timesteps = 100,
-    cond_drop_prob = 0.2
+unet2 = Unet(
+    dim = 16,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16)
 ).cuda()

-loss = decoder(images)
-loss.backward()
+decoder = Decoder(
+    unet = (unet1, unet2),
+    image_sizes = (128, 256),
+    clip = clip,
+    timesteps = 100,
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5,
+    condition_on_text_encodings = False  # set this to True if you wish to condition on text during training and sampling
+).cuda()
+
+for unet_number in (1, 2):
+    loss = decoder(images, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
+    loss.backward()

 # do above for many steps

@@ -301,13 +371,431 @@ images = dalle2(
    cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition)
 )

-# save your image
+# save your image (in this example, of size 256x256)
 ```

 Everything in this readme should run without error

+You can also train the decoder on images of greater than the size (say 512x512) at which CLIP was trained (256x256). The images will be resized to CLIP image resolution for the image embeddings
+
 For the layperson, no worries, training will all be automated into a CLI tool, at least for small scale training.

+## Training on Preprocessed CLIP Embeddings
+
+It is likely, when scaling up, that you would first preprocess your images and text into corresponding embeddings before training the prior network. You can do so easily by simply passing in `image_embed`, `text_embed`, and optionally `text_encodings` and `text_mask`
+
+Working example below
+
+```python
+import torch
+from dalle2_pytorch import DiffusionPriorNetwork, DiffusionPrior, CLIP
+
+# get trained CLIP from step one
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 6,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 6,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8,
+).cuda()
+
+# setup prior network, which contains an autoregressive transformer
+
+prior_network = DiffusionPriorNetwork(
+    dim = 512,
+    depth = 6,
+    dim_head = 64,
+    heads = 8
+).cuda()
+
+# diffusion prior network, which contains the CLIP and network (with transformer) above
+
+diffusion_prior = DiffusionPrior(
+    net = prior_network,
+    clip = clip,
+    timesteps = 100,
+    cond_drop_prob = 0.2,
+    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
+).cuda()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# precompute the text and image embeddings
+# here using the diffusion prior class, but could be done with CLIP alone
+
+clip_image_embeds = diffusion_prior.clip.embed_image(images).image_embed
+clip_text_embeds = diffusion_prior.clip.embed_text(text).text_embed
+
+# feed text and images into diffusion prior network
+
+loss = diffusion_prior(
+    text_embed = clip_text_embeds,
+    image_embed = clip_image_embeds
+)
+
+loss.backward()
+
+# do the above for many many many steps
+# now the diffusion prior can generate image embeddings from the text embeddings
+```
+
+You can also completely go `CLIP`-less, in which case you will need to pass in the `image_embed_dim` into the `DiffusionPrior` on initialization
+
+```python
+import torch
+from dalle2_pytorch import DiffusionPriorNetwork, DiffusionPrior
+
+# setup prior network, which contains an autoregressive transformer
+
+prior_network = DiffusionPriorNetwork(
+    dim = 512,
+    depth = 6,
+    dim_head = 64,
+    heads = 8
+).cuda()
+
+# diffusion prior network, which contains the CLIP and network (with transformer) above
+
+diffusion_prior = DiffusionPrior(
+    net = prior_network,
+    image_embed_dim = 512,               # this needs to be set
+    timesteps = 100,
+    cond_drop_prob = 0.2,
+    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
+).cuda()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# precompute the text and image embeddings
+# here using the diffusion prior class, but could be done with CLIP alone
+
+clip_image_embeds = torch.randn(4, 512).cuda()
+clip_text_embeds = torch.randn(4, 512).cuda()
+
+# feed text and images into diffusion prior network
+
+loss = diffusion_prior(
+    text_embed = clip_text_embeds,
+    image_embed = clip_image_embeds
+)
+
+loss.backward()
+
+# do the above for many many many steps
+# now the diffusion prior can generate image embeddings from the text embeddings
+```
+
+## OpenAI CLIP
+
+Although there is the possibility they are using an unreleased, more powerful CLIP, you can use one of the released ones, if you do not wish to train your own CLIP from scratch. This will also allow the community to more quickly validate the conclusions of the paper.
+
+To use a pretrained OpenAI CLIP, simply import `OpenAIClipAdapter` and pass it into the `DiffusionPrior` or `Decoder` like so
+
+```python
+import torch
+from dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder, OpenAIClipAdapter
+
+# openai pretrained clip - defaults to ViT/B-32
+
+clip = OpenAIClipAdapter()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# prior networks (with transformer)
+
+prior_network = DiffusionPriorNetwork(
+    dim = 512,
+    depth = 6,
+    dim_head = 64,
+    heads = 8
+).cuda()
+
+diffusion_prior = DiffusionPrior(
+    net = prior_network,
+    clip = clip,
+    timesteps = 100,
+    cond_drop_prob = 0.2
+).cuda()
+
+loss = diffusion_prior(text, images)
+loss.backward()
+
+# do above for many steps ...
+
+# decoder (with unet)
+
+unet1 = Unet(
+    dim = 128,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults=(1, 2, 4, 8)
+).cuda()
+
+unet2 = Unet(
+    dim = 16,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16)
+).cuda()
+
+decoder = Decoder(
+    unet = (unet1, unet2),
+    image_sizes = (128, 256),
+    clip = clip,
+    timesteps = 100,
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5,
+    condition_on_text_encodings = False  # set this to True if you wish to condition on text during training and sampling
+).cuda()
+
+for unet_number in (1, 2):
+    loss = decoder(images, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
+    loss.backward()
+
+# do above for many steps
+
+dalle2 = DALLE2(
+    prior = diffusion_prior,
+    decoder = decoder
+)
+
+images = dalle2(
+    ['a butterfly trying to escape a tornado'],
+    cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition)
+)
+
+# save your image (in this example, of size 256x256)
+```
+
+Now you'll just have to worry about training the Prior and the Decoder!
+
+## Experimental
+
+### DALL-E2 with Latent Diffusion
+
+This repository decides to take the next step and offer DALL-E v2 combined with <a href="https://huggingface.co/spaces/multimodalart/latentdiffusion">latent diffusion</a>, from Rombach et al.
+
+You can use it as follows. Latent diffusion can be limited to just the first U-Net in the cascade, or to any number you wish.
+
+The repository also comes equipped with all the necessary settings to recreate `ViT-VQGan` from the <a href="https://arxiv.org/abs/2110.04627">Improved VQGans</a> paper. Furthermore, the <a href="https://github.com/lucidrains/vector-quantize-pytorch">vector quantization</a> library also comes equipped to do <a href="https://arxiv.org/abs/2203.01941">residual or multi-headed quantization</a>, which I believe will give an even further boost in performance to the autoencoder.
+
+```python
+import torch
+from dalle2_pytorch import Unet, Decoder, CLIP, VQGanVAE
+
+# trained clip from step 1
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 1,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 1,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8
+)
+
+# 3 unets for the decoder (a la cascading DDPM)
+
+# first two unets are doing latent diffusion
+# vqgan-vae must be trained beforehand
+
+vae1 = VQGanVAE(
+    dim = 32,
+    image_size = 256,
+    layers = 3,
+    layer_mults = (1, 2, 4)
+)
+
+vae2 = VQGanVAE(
+    dim = 32,
+    image_size = 512,
+    layers = 3,
+    layer_mults = (1, 2, 4)
+)
+
+unet1 = Unet(
+    dim = 32,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    sparse_attn = True,
+    sparse_attn_window = 2,
+    dim_mults = (1, 2, 4, 8)
+)
+
+unet2 = Unet(
+    dim = 32,
+    image_embed_dim = 512,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16),
+    cond_on_image_embeds = True,
+    cond_on_text_encodings = False
+)
+
+unet3 = Unet(
+    dim = 32,
+    image_embed_dim = 512,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16),
+    cond_on_image_embeds = True,
+    cond_on_text_encodings = False,
+    attend_at_middle = False
+)
+
+# decoder, which contains the unet(s) and clip
+
+decoder = Decoder(
+    clip = clip,
+    vae = (vae1, vae2),                # latent diffusion for unet1 (vae1) and unet2 (vae2), but not for the last unet3
+    unet = (unet1, unet2, unet3),      # insert unets in order of low resolution to highest resolution (you can have as many stages as you want here)
+    image_sizes = (256, 512, 1024),    # resolutions, 256 for first unet, 512 for second, 1024 for third
+    timesteps = 100,
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5
+).cuda()
+
+# mock images (get a lot of this)
+
+images = torch.randn(1, 3, 1024, 1024).cuda()
+
+# feed images into decoder, specifying which unet you want to train
+# each unet can be trained separately, which is one of the benefits of the cascading DDPM scheme
+
+with decoder.one_unet_in_gpu(1):
+    loss = decoder(images, unet_number = 1)
+    loss.backward()
+
+with decoder.one_unet_in_gpu(2):
+    loss = decoder(images, unet_number = 2)
+    loss.backward()
+
+with decoder.one_unet_in_gpu(3):
+    loss = decoder(images, unet_number = 3)
+    loss.backward()
+
+# do the above for many steps for both unets
+
+# then it will learn to generate images based on the CLIP image embeddings
+
+# chaining the unets from lowest resolution to highest resolution (thus cascading)
+
+mock_image_embed = torch.randn(1, 512).cuda()
+images = decoder.sample(mock_image_embed) # (1, 3, 1024, 1024)
+```
+
+## Training wrapper (wip)
+
+### Decoder Training
+
+Training the `Decoder` may be confusing, as one needs to keep track of an optimizer for each of the `Unet`(s) separately. Each `Unet` will also need its own corresponding exponential moving average. The `DecoderTrainer` hopes to make this simple, as shown below
+
+```python
+import torch
+from dalle2_pytorch import DALLE2, Unet, Decoder, CLIP, DecoderTrainer
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 6,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 6,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8
+).cuda()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# decoder (with unet)
+
+unet1 = Unet(
+    dim = 128,
+    image_embed_dim = 512,
+    text_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults=(1, 2, 4, 8)
+).cuda()
+
+unet2 = Unet(
+    dim = 16,
+    image_embed_dim = 512,
+    text_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults = (1, 2, 4, 8, 16),
+    cond_on_text_encodings = True
+).cuda()
+
+decoder = Decoder(
+    unet = (unet1, unet2),
+    image_sizes = (128, 256),
+    clip = clip,
+    timesteps = 1000,
+    condition_on_text_encodings = True
+).cuda()
+
+decoder_trainer = DecoderTrainer(
+    decoder,
+    lr = 3e-4,
+    wd = 1e-2,
+    ema_beta = 0.99,
+    ema_update_after_step = 1000,
+    ema_update_every = 10,
+)
+
+for unet_number in (1, 2):
+    loss = decoder_trainer(images, text = text, unet_number = unet_number)  # use the decoder_trainer forward
+    loss.backward()
+
+    decoder_trainer.update(unet_number) # update the specific unet as well as its exponential moving average
+
+# after much training
+# you can sample from the exponentially moving averaged unets as so
+
+mock_image_embed = torch.randn(4, 512).cuda()
+images = decoder_trainer.sample(mock_image_embed, text = text) # (4, 3, 256, 256)
+```
+
+## CLI (wip)
+
+```bash
+$ dream 'sharing a sunset at the summit of mount everest with my dog'
+```
+
+Once built, images will be saved to the same directory the command is invoked
+
+<a href="https://github.com/lucidrains/big-sleep">template</a>
+
 ## Training CLI (wip)

 <a href="https://github.com/lucidrains/stylegan2-pytorch">template</a>
@@ -317,12 +805,38 @@ For the layperson, no worries, training will all be automated into a CLI tool, a
 - [x] finish off gaussian diffusion class for latent embedding - allow for prediction of epsilon
 - [x] add what was proposed in the paper, where DDPM objective for image latent embedding predicts x0 directly (reread vq-diffusion paper and get caught up on that line of work)
 - [x] make sure it works end to end to produce an output tensor, taking a single gradient step
- [ ] augment unet so that it can also be conditioned on text encodings (although in paper they hinted this didn't make much a difference)
- [ ] look into Jonathan Ho's cascading DDPM for the decoder, as that seems to be what they are using. get caught up on DDPM literature
- [ ] figure out all the current bag of tricks needed to make DDPMs great (starting with the blur trick mentioned in paper)
+- [x] augment unet so that it can also be conditioned on text encodings (although in paper they hinted this didn't make much a difference)
+- [x] figure out all the current bag of tricks needed to make DDPMs great (starting with the blur trick mentioned in paper)
+- [x] build the cascading ddpm by having Decoder class manage multiple unets at different resolutions
+- [x] add efficient attention in unet
+- [x] be able to finely customize what to condition on (text, image embed) for specific unet in the cascade (super resolution ddpms near the end may not need too much conditioning)
+- [x] offload unets not being trained on to CPU for memory efficiency (for training each resolution unets separately)
+- [x] build out latent diffusion architecture, with the vq-reg variant (vqgan-vae), make it completely optional and compatible with cascading ddpms
+- [x] for decoder, allow ability to customize objective (predict epsilon vs x0), in case latent diffusion does better with prediction of x0
+- [x] use attention-based upsampling https://arxiv.org/abs/2112.11435
+- [x] use inheritance just this once for sharing logic between decoder and prior network ddpms
+- [x] bring in vit-vqgan https://arxiv.org/abs/2110.04627 for the latent diffusion
+- [x] abstract interface for CLIP adapter class, so other CLIPs can be brought in
+- [x] take care of mixed precision as well as gradient accumulation within decoder trainer
+- [x] just take care of the training for the decoder in a wrapper class, as each unet in the cascade will need its own optimizer
+- [x] bring in tools to train vqgan-vae
+- [x] add convnext backbone for vqgan-vae (in addition to vit [vit-vqgan] + resnet)
+- [x] make sure DDPMs can be run with traditional resnet blocks (but leave convnext as an option for experimentation)
+- [x] make sure for the latter unets in the cascade, one can train on crops for learning super resolution (constrain the unet to be only convolutions in that case, or allow conv-like attention with rel pos bias)
+- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo)
+- [ ] copy the cascading ddpm code to a separate repo (perhaps https://github.com/lucidrains/denoising-diffusion-pytorch) as the main contribution of dalle2 really is just the prior network
+- [ ] transcribe code to Jax, which lowers the activation energy for distributed training, given access to TPUs
+- [ ] pull logic for training diffusion prior into a class DiffusionPriorTrainer, for eventual script based + CLI based training
 - [ ] train on a toy task, offer in colab
- [ ] add attention to unet - apply some personal tricks with efficient attention
- [ ] figure out the big idea behind latent diffusion and what can be ported over
+- [ ] think about how best to design a declarative training config that handles preencoding for prior and training of multiple networks in decoder
+- [ ] extend diffusion head to use diffusion-gan (potentially using lightweight-gan) to speed up inference
+- [ ] bring in cross-scale embedding from iclr paper https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/crossformer.py#L14
+- [ ] figure out if possible to augment with external memory, as described in https://arxiv.org/abs/2204.11824
+- [ ] test out grid attention in cascading ddpm locally, decide whether to keep or remove
+- [ ] use an experimental tracker agnostic setup, as done <a href="https://github.com/lucidrains/tf-bind-transformer#simple-trainer-class-for-fine-tuning">here</a>
+- [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2
+- [ ] make sure FILIP works with DALL-E2 from x-clip https://arxiv.org/abs/2111.07783
+- [ ] make sure resnet | convnext block hyperparameters can be configurable across unet depth (groups and expansion factor)

 ## Citations

@@ -361,14 +875,41 @@ For the layperson, no worries, training will all be automated into a CLI tool, a
 ```

 ```bibtex
-@misc{zhang2019root,
-    title   = {Root Mean Square Layer Normalization},
-    author  = {Biao Zhang and Rico Sennrich},
-    year    = {2019},
-    eprint  = {1910.07467},
-    archivePrefix = {arXiv},
-    primaryClass = {cs.LG}
+@article{shen2019efficient,
+    author  = {Zhuoran Shen and Mingyuan Zhang and Haiyu Zhao and Shuai Yi and Hongsheng Li},
+    title   = {Efficient Attention: Attention with Linear Complexities},
+    journal = {CoRR},
+    year    = {2018},
+    url     = {http://arxiv.org/abs/1812.01243},
 }
 ```

-*Creating noise from data is easy; creating data from noise is generative modeling.* - Yang Song's <a href="https://arxiv.org/abs/2011.13456">paper</a>
+```bibtex
+@inproceedings{Tu2022MaxViTMV,
+    title   = {MaxViT: Multi-Axis Vision Transformer},
+    author  = {Zhe-Wei Tu and Hossein Talebi and Han Zhang and Feng Yang and Peyman Milanfar and Alan Conrad Bovik and Yinxiao Li},
+    year    = {2022}
+}
+```
+
+```bibtex
+@article{Yu2021VectorquantizedIM,
+    title   = {Vector-quantized Image Modeling with Improved VQGAN},
+    author  = {Jiahui Yu and Xin Li and Jing Yu Koh and Han Zhang and Ruoming Pang and James Qin and Alexander Ku and Yuanzhong Xu and Jason Baldridge and Yonghui Wu},
+    journal = {ArXiv},
+    year    = {2021},
+    volume  = {abs/2110.04627}
+}
+```
+
+```bibtex
+@article{Shleifer2021NormFormerIT,
+    title   = {NormFormer: Improved Transformer Pretraining with Extra Normalization},
+    author  = {Sam Shleifer and Jason Weston and Myle Ott},
+    journal = {ArXiv},
+    year    = {2021},
+    volume  = {abs/2110.09456}
+}
+```
+
+*Creating noise from data is easy; creating data from noise is generative modeling.* - <a href="https://arxiv.org/abs/2011.13456">Yang Song's paper</a>
--- a/dalle2_pytorch/init.py
+++ b/dalle2_pytorch/init.py
@@ -1,2 +1,6 @@
 from dalle2_pytorch.dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
+from dalle2_pytorch.dalle2_pytorch import OpenAIClipAdapter
+from dalle2_pytorch.train import DecoderTrainer
+
+from dalle2_pytorch.vqgan_vae import VQGanVAE
 from x_clip import CLIP
--- a/dalle2_pytorch/cli.py
+++ b/dalle2_pytorch/cli.py
@@ -1,9 +1,52 @@
 import click
+import torch
+import torchvision.transforms as T
+from functools import reduce
+from pathlib import Path
+
+from dalle2_pytorch import DALLE2, Decoder, DiffusionPrior
+
+def safeget(dictionary, keys, default = None):
+    return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split('.'), dictionary)
+
+def simple_slugify(text, max_length = 255):
+    return text.replace("-", "_").replace(",", "").replace(" ", "_").replace("|", "--").strip('-_')[:max_length]
+
+def get_pkg_version():
+    from pkg_resources import get_distribution
+    return get_distribution('dalle2_pytorch').version

 def main():
    pass

@click.command()
+@click.option('--model', default = './dalle2.pt', help = 'path to trained DALL-E2 model')
+@click.option('--cond_scale', default = 2, help = 'conditioning scale (classifier free guidance) in decoder')
@click.argument('text')
-def dream(text):
-    return image
+def dream(
+    model,
+    cond_scale,
+    text
+):
+    model_path = Path(model)
+    full_model_path = str(model_path.resolve())
+    assert model_path.exists(), f'model not found at {full_model_path}'
+    loaded = torch.load(str(model_path))
+
+    version = safeget(loaded, 'version')
+    print(f'loading DALL-E2 from {full_model_path}, saved at version {version} - current package version is {get_pkg_version()}')
+
+    prior_init_params = safeget(loaded, 'init_params.prior')
+    decoder_init_params = safeget(loaded, 'init_params.decoder')
+    model_params = safeget(loaded, 'model_params')
+
+    prior = DiffusionPrior(**prior_init_params)
+    decoder = Decoder(**decoder_init_params)
+
+    dalle2 = DALLE2(prior, decoder)
+    dalle2.load_state_dict(model_params)
+
+    image = dalle2(text, cond_scale = cond_scale)
+
+    pil_image = T.ToPILImage()(image)
+    return pil_image.save(f'./{simple_slugify(text)}.png')
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
--- a/dalle2_pytorch/optimizer.py
+++ b/dalle2_pytorch/optimizer.py
@@ -0,0 +1,29 @@
+from torch.optim import AdamW, Adam
+
+def separate_weight_decayable_params(params):
+    no_wd_params = set([param for param in params if param.ndim < 2])
+    wd_params = set(params) - no_wd_params
+    return wd_params, no_wd_params
+
+def get_optimizer(
+    params,
+    lr = 3e-4,
+    wd = 1e-2,
+    betas = (0.9, 0.999),
+    filter_by_requires_grad = False
+):
+    if filter_by_requires_grad:
+        params = list(filter(lambda t: t.requires_grad, params))
+
+    if wd == 0:
+        return Adam(params, lr = lr, betas = betas)
+
+    params = set(params)
+    wd_params, no_wd_params = separate_weight_decayable_params(params)
+
+    param_groups = [
+        {'params': list(wd_params)},
+        {'params': list(no_wd_params), 'weight_decay': 0},
+    ]
+
+    return AdamW(param_groups, lr = lr, weight_decay = wd, betas = betas)
--- a/dalle2_pytorch/train.py
+++ b/dalle2_pytorch/train.py
@@ -0,0 +1,199 @@
+import copy
+from functools import partial
+
+import torch
+from torch import nn
+from torch.cuda.amp import autocast, GradScaler
+
+from dalle2_pytorch.dalle2_pytorch import Decoder
+from dalle2_pytorch.optimizer import get_optimizer
+
+# helper functions
+
+def exists(val):
+    return val is not None
+
+def cast_tuple(val, length = 1):
+    return val if isinstance(val, tuple) else ((val,) * length)
+
+def pick_and_pop(keys, d):
+    values = list(map(lambda key: d.pop(key), keys))
+    return dict(zip(keys, values))
+
+def group_dict_by_key(cond, d):
+    return_val = [dict(),dict()]
+    for key in d.keys():
+        match = bool(cond(key))
+        ind = int(not match)
+        return_val[ind][key] = d[key]
+    return (*return_val,)
+
+def string_begins_with(prefix, str):
+    return str.startswith(prefix)
+
+def group_by_key_prefix(prefix, d):
+    return group_dict_by_key(partial(string_begins_with, prefix), d)
+
+def groupby_prefix_and_trim(prefix, d):
+    kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d)
+    kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
+    return kwargs_without_prefix, kwargs
+
+# exponential moving average wrapper
+
+class EMA(nn.Module):
+    def __init__(
+        self,
+        model,
+        beta = 0.99,
+        update_after_step = 1000,
+        update_every = 10,
+    ):
+        super().__init__()
+        self.beta = beta
+        self.online_model = model
+        self.ema_model = copy.deepcopy(model)
+
+        self.update_after_step = update_after_step # only start EMA after this step number, starting at 0
+        self.update_every = update_every
+
+        self.register_buffer('initted', torch.Tensor([False]))
+        self.register_buffer('step', torch.tensor([0.]))
+
+    def update(self):
+        self.step += 1
+
+        if self.step <= self.update_after_step or (self.step % self.update_every) != 0:
+            return
+
+        if not self.initted:
+            self.ema_model.state_dict(self.online_model.state_dict())
+            self.initted.data.copy_(torch.Tensor([True]))
+
+        self.update_moving_average(self.ema_model, self.online_model)
+
+    def update_moving_average(self, ma_model, current_model):
+        def calculate_ema(beta, old, new):
+            if not exists(old):
+                return new
+            return old * beta + (1 - beta) * new
+
+        for current_params, ma_params in zip(current_model.parameters(), ma_model.parameters()):
+            old_weight, up_weight = ma_params.data, current_params.data
+            ma_params.data = calculate_ema(self.beta, old_weight, up_weight)
+
+        for current_buffer, ma_buffer in zip(current_model.buffers(), ma_model.buffers()):
+            new_buffer_value = calculate_ema(self.beta, ma_buffer, current_buffer)
+            ma_buffer.copy_(new_buffer_value)
+
+    def __call__(self, *args, **kwargs):
+        return self.ema_model(*args, **kwargs)
+
+# trainers
+
+class DecoderTrainer(nn.Module):
+    def __init__(
+        self,
+        decoder,
+        use_ema = True,
+        lr = 3e-4,
+        wd = 1e-2,
+        max_grad_norm = None,
+        amp = False,
+        **kwargs
+    ):
+        super().__init__()
+        assert isinstance(decoder, Decoder)
+        ema_kwargs, kwargs = groupby_prefix_and_trim('ema_', kwargs)
+
+        self.decoder = decoder
+        self.num_unets = len(self.decoder.unets)
+
+        self.use_ema = use_ema
+
+        if use_ema:
+            has_lazy_linear = any([type(module) == nn.LazyLinear for module in decoder.modules()])
+            assert not has_lazy_linear, 'you must set the text_embed_dim on your u-nets if you plan on doing automatic exponential moving average'
+
+        self.ema_unets = nn.ModuleList([])
+
+        self.amp = amp
+
+        # be able to finely customize learning rate, weight decay
+        # per unet
+
+        lr, wd = map(partial(cast_tuple, length = self.num_unets), (lr, wd))
+
+        for ind, (unet, unet_lr, unet_wd) in enumerate(zip(self.decoder.unets, lr, wd)):
+            optimizer = get_optimizer(
+                unet.parameters(),
+                lr = unet_lr,
+                wd = unet_wd,
+                **kwargs
+            )
+
+            setattr(self, f'optim{ind}', optimizer) # cannot use pytorch ModuleList for some reason with optimizers
+
+            if self.use_ema:
+                self.ema_unets.append(EMA(unet, **ema_kwargs))
+
+            scaler = GradScaler(enabled = amp)
+            setattr(self, f'scaler{ind}', scaler)
+
+        # gradient clipping if needed
+
+        self.max_grad_norm = max_grad_norm
+
+    @property
+    def unets(self):
+        return nn.ModuleList([ema.ema_model for ema in self.ema_unets])
+
+    def scale(self, loss, *, unet_number):
+        assert 1 <= unet_number <= self.num_unets
+        index = unet_number - 1
+        scaler = getattr(self, f'scaler{index}')
+        return scaler.scale(loss)
+
+    def update(self, unet_number):
+        assert 1 <= unet_number <= self.num_unets
+        index = unet_number - 1
+        unet = self.decoder.unets[index]
+
+        optimizer = getattr(self, f'optim{index}')
+        scaler = getattr(self, f'scaler{index}')
+
+        if exists(self.max_grad_norm):
+            scaler.unscale_(optimizer)
+            nn.utils.clip_grad_norm_(unet.parameters(), self.max_grad_norm)
+
+        scaler.step(optimizer)
+        scaler.update()
+        optimizer.zero_grad()
+
+        if self.use_ema:
+            ema_unet = self.ema_unets[index]
+            ema_unet.update()
+
+    @torch.no_grad()
+    def sample(self, *args, **kwargs):
+        if self.use_ema:
+            trainable_unets = self.decoder.unets
+            self.decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling
+
+        output = self.decoder.sample(*args, **kwargs)
+
+        if self.use_ema:
+            self.decoder.unets = trainable_unets             # restore original training unets
+        return output
+
+    def forward(
+        self,
+        x,
+        *,
+        unet_number,
+        divisor = 1,
+        **kwargs
+    ):
+        with autocast(enabled = self.amp):
+            loss = self.decoder(x, unet_number = unet_number, **kwargs)
+        return self.scale(loss / divisor, unet_number = unet_number)
--- a/dalle2_pytorch/train_vqgan_vae.py
+++ b/dalle2_pytorch/train_vqgan_vae.py
@@ -0,0 +1,266 @@
+from math import sqrt
+import copy
+from random import choice
+from pathlib import Path
+from shutil import rmtree
+
+import torch
+from torch import nn
+
+from PIL import Image
+from torchvision.datasets import ImageFolder
+import torchvision.transforms as T
+from torch.utils.data import Dataset, DataLoader, random_split
+from torchvision.utils import make_grid, save_image
+
+from einops import rearrange
+
+from dalle2_pytorch.train import EMA
+from dalle2_pytorch.vqgan_vae import VQGanVAE
+from dalle2_pytorch.optimizer import get_optimizer
+
+# helpers
+
+def exists(val):
+    return val is not None
+
+def noop(*args, **kwargs):
+    pass
+
+def cycle(dl):
+    while True:
+        for data in dl:
+            yield data
+
+def cast_tuple(t):
+    return t if isinstance(t, (tuple, list)) else (t,)
+
+def yes_or_no(question):
+    answer = input(f'{question} (y/n) ')
+    return answer.lower() in ('yes', 'y')
+
+def accum_log(log, new_logs):
+    for key, new_value in new_logs.items():
+        old_value = log.get(key, 0.)
+        log[key] = old_value + new_value
+    return log
+
+# classes
+
+class ImageDataset(Dataset):
+    def __init__(
+        self,
+        folder,
+        image_size,
+        exts = ['jpg', 'jpeg', 'png']
+    ):
+        super().__init__()
+        self.folder = folder
+        self.image_size = image_size
+        self.paths = [p for ext in exts for p in Path(f'{folder}').glob(f'**/*.{ext}')]
+
+        print(f'{len(self.paths)} training samples found at {folder}')
+
+        self.transform = T.Compose([
+            T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+            T.Resize(image_size),
+            T.RandomHorizontalFlip(),
+            T.CenterCrop(image_size),
+            T.ToTensor()
+        ])
+
+    def __len__(self):
+        return len(self.paths)
+
+    def __getitem__(self, index):
+        path = self.paths[index]
+        img = Image.open(path)
+        return self.transform(img)
+
+# main trainer class
+
+class VQGanVAETrainer(nn.Module):
+    def __init__(
+        self,
+        vae,
+        *,
+        num_train_steps,
+        lr,
+        batch_size,
+        folder,
+        grad_accum_every,
+        wd = 0.,
+        save_results_every = 100,
+        save_model_every = 1000,
+        results_folder = './results',
+        valid_frac = 0.05,
+        random_split_seed = 42,
+        ema_beta = 0.995,
+        ema_update_after_step = 2000,
+        ema_update_every = 10,
+        apply_grad_penalty_every = 4,
+    ):
+        super().__init__()
+        assert isinstance(vae, VQGanVAE), 'vae must be instance of VQGanVAE'
+        image_size = vae.image_size
+
+        self.vae = vae
+        self.ema_vae = EMA(vae, update_after_step = ema_update_after_step, update_every = ema_update_every)
+
+        self.register_buffer('steps', torch.Tensor([0]))
+
+        self.num_train_steps = num_train_steps
+        self.batch_size = batch_size
+        self.grad_accum_every = grad_accum_every
+
+        all_parameters = set(vae.parameters())
+        discr_parameters = set(vae.discr.parameters())
+        vae_parameters = all_parameters - discr_parameters
+
+        self.optim = get_optimizer(vae_parameters, lr = lr, wd = wd)
+        self.discr_optim = get_optimizer(discr_parameters, lr = lr, wd = wd)
+
+        # create dataset
+
+        self.ds = ImageDataset(folder, image_size = image_size)
+
+        # split for validation
+
+        if valid_frac > 0:
+            train_size = int((1 - valid_frac) * len(self.ds))
+            valid_size = len(self.ds) - train_size
+            self.ds, self.valid_ds = random_split(self.ds, [train_size, valid_size], generator = torch.Generator().manual_seed(random_split_seed))
+            print(f'training with dataset of {len(self.ds)} samples and validating with randomly splitted {len(self.valid_ds)} samples')
+        else:
+            self.valid_ds = self.ds
+            print(f'training with shared training and valid dataset of {len(self.ds)} samples')
+
+        # dataloader
+
+        self.dl = cycle(DataLoader(
+            self.ds,
+            batch_size = batch_size,
+            shuffle = True
+        ))
+
+        self.valid_dl = cycle(DataLoader(
+            self.valid_ds,
+            batch_size = batch_size,
+            shuffle = True
+        ))
+
+        self.save_model_every = save_model_every
+        self.save_results_every = save_results_every
+
+        self.apply_grad_penalty_every = apply_grad_penalty_every
+
+        self.results_folder = Path(results_folder)
+
+        if len([*self.results_folder.glob('**/*')]) > 0 and yes_or_no('do you want to clear previous experiment checkpoints and results?'):
+            rmtree(str(self.results_folder))
+
+        self.results_folder.mkdir(parents = True, exist_ok = True)
+
+    def train_step(self):
+        device = next(self.vae.parameters()).device
+        steps = int(self.steps.item())
+        apply_grad_penalty = not (steps % self.apply_grad_penalty_every)
+
+        self.vae.train()
+
+        # logs
+
+        logs = {}
+
+        # update vae (generator)
+
+        for _ in range(self.grad_accum_every):
+            img = next(self.dl)
+            img = img.to(device)
+
+            loss = self.vae(
+                img,
+                return_loss = True,
+                apply_grad_penalty = apply_grad_penalty
+            )
+
+            accum_log(logs, {'loss': loss.item() / self.grad_accum_every})
+
+            (loss / self.grad_accum_every).backward()
+
+        self.optim.step()
+        self.optim.zero_grad()
+
+
+        # update discriminator
+
+        if exists(self.vae.discr):
+            discr_loss = 0
+            for _ in range(self.grad_accum_every):
+                img = next(self.dl)
+                img = img.to(device)
+
+                loss = self.vae(img, return_discr_loss = True)
+                accum_log(logs, {'discr_loss': loss.item() / self.grad_accum_every})
+
+                (loss / self.grad_accum_every).backward()
+
+            self.discr_optim.step()
+            self.discr_optim.zero_grad()
+
+            # log
+
+            print(f"{steps}: vae loss: {logs['loss']} - discr loss: {logs['discr_loss']}")
+
+        # update exponential moving averaged generator
+
+        self.ema_vae.update()
+
+        # sample results every so often
+
+        if not (steps % self.save_results_every):
+            for model, filename in ((self.ema_vae.ema_model, f'{steps}.ema'), (self.vae, str(steps))):
+                model.eval()
+
+                imgs = next(self.dl)
+                imgs = imgs.to(device)
+
+                recons = model(imgs)
+                nrows = int(sqrt(self.batch_size))
+
+                imgs_and_recons = torch.stack((imgs, recons), dim = 0)
+                imgs_and_recons = rearrange(imgs_and_recons, 'r b ... -> (b r) ...')
+
+                imgs_and_recons = imgs_and_recons.detach().cpu().float().clamp(0., 1.)
+                grid = make_grid(imgs_and_recons, nrow = 2, normalize = True, value_range = (0, 1))
+
+                logs['reconstructions'] = grid
+
+                save_image(grid, str(self.results_folder / f'{filename}.png'))
+
+            print(f'{steps}: saving to {str(self.results_folder)}')
+
+        # save model every so often
+
+        if not (steps % self.save_model_every):
+            state_dict = self.vae.state_dict()
+            model_path = str(self.results_folder / f'vae.{steps}.pt')
+            torch.save(state_dict, model_path)
+
+            ema_state_dict = self.ema_vae.state_dict()
+            model_path = str(self.results_folder / f'vae.{steps}.ema.pt')
+            torch.save(ema_state_dict, model_path)
+
+            print(f'{steps}: saving model to {str(self.results_folder)}')
+
+        self.steps += 1
+        return logs
+
+    def train(self, log_fn = noop):
+        device = next(self.vae.parameters()).device
+
+        while self.steps < self.num_train_steps:
+            logs = self.train_step()
+            log_fn(logs)
+
+        print('training complete')
--- a/dalle2_pytorch/vqgan_vae.py
+++ b/dalle2_pytorch/vqgan_vae.py
@@ -0,0 +1,873 @@
+import copy
+import math
+from math import sqrt
+from functools import partial, wraps
+
+from vector_quantize_pytorch import VectorQuantize as VQ
+
+import torch
+from torch import nn, einsum
+import torch.nn.functional as F
+from torch.autograd import grad as torch_grad
+import torchvision
+
+from einops import rearrange, reduce, repeat
+from einops_exts import rearrange_many
+from einops.layers.torch import Rearrange
+
+# constants
+
+MList = nn.ModuleList
+
+# helper functions
+
+def exists(val):
+    return val is not None
+
+def default(val, d):
+    return val if exists(val) else d
+
+# decorators
+
+def eval_decorator(fn):
+    def inner(model, *args, **kwargs):
+        was_training = model.training
+        model.eval()
+        out = fn(model, *args, **kwargs)
+        model.train(was_training)
+        return out
+    return inner
+
+def remove_vgg(fn):
+    @wraps(fn)
+    def inner(self, *args, **kwargs):
+        has_vgg = hasattr(self, 'vgg')
+        if has_vgg:
+            vgg = self.vgg
+            delattr(self, 'vgg')
+
+        out = fn(self, *args, **kwargs)
+
+        if has_vgg:
+            self.vgg = vgg
+
+        return out
+    return inner
+
+# keyword argument helpers
+
+def pick_and_pop(keys, d):
+    values = list(map(lambda key: d.pop(key), keys))
+    return dict(zip(keys, values))
+
+def group_dict_by_key(cond, d):
+    return_val = [dict(),dict()]
+    for key in d.keys():
+        match = bool(cond(key))
+        ind = int(not match)
+        return_val[ind][key] = d[key]
+    return (*return_val,)
+
+def string_begins_with(prefix, str):
+    return str.startswith(prefix)
+
+def group_by_key_prefix(prefix, d):
+    return group_dict_by_key(partial(string_begins_with, prefix), d)
+
+def groupby_prefix_and_trim(prefix, d):
+    kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d)
+    kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
+    return kwargs_without_prefix, kwargs
+
+# tensor helper functions
+
+def log(t, eps = 1e-10):
+    return torch.log(t + eps)
+
+def gradient_penalty(images, output, weight = 10):
+    batch_size = images.shape[0]
+    gradients = torch_grad(outputs = output, inputs = images,
+                           grad_outputs = torch.ones(output.size(), device = images.device),
+                           create_graph = True, retain_graph = True, only_inputs = True)[0]
+
+    gradients = rearrange(gradients, 'b ... -> b (...)')
+    return weight * ((gradients.norm(2, dim = 1) - 1) ** 2).mean()
+
+def l2norm(t):
+    return F.normalize(t, dim = -1)
+
+def leaky_relu(p = 0.1):
+    return nn.LeakyReLU(0.1)
+
+def stable_softmax(t, dim = -1, alpha = 32 ** 2):
+    t = t / alpha
+    t = t - torch.amax(t, dim = dim, keepdim = True).detach()
+    return (t * alpha).softmax(dim = dim)
+
+def safe_div(numer, denom, eps = 1e-8):
+    return numer / (denom + eps)
+
+# gan losses
+
+def hinge_discr_loss(fake, real):
+    return (F.relu(1 + fake) + F.relu(1 - real)).mean()
+
+def hinge_gen_loss(fake):
+    return -fake.mean()
+
+def bce_discr_loss(fake, real):
+    return (-log(1 - torch.sigmoid(fake)) - log(torch.sigmoid(real))).mean()
+
+def bce_gen_loss(fake):
+    return -log(torch.sigmoid(fake)).mean()
+
+def grad_layer_wrt_loss(loss, layer):
+    return torch_grad(
+        outputs = loss,
+        inputs = layer,
+        grad_outputs = torch.ones_like(loss),
+        retain_graph = True
+    )[0].detach()
+
+# vqgan vae
+
+class LayerNormChan(nn.Module):
+    def __init__(
+        self,
+        dim,
+        eps = 1e-5
+    ):
+        super().__init__()
+        self.eps = eps
+        self.gamma = nn.Parameter(torch.ones(1, dim, 1, 1))
+
+    def forward(self, x):
+        var = torch.var(x, dim = 1, unbiased = False, keepdim = True)
+        mean = torch.mean(x, dim = 1, keepdim = True)
+        return (x - mean) / (var + self.eps).sqrt() * self.gamma
+
+# discriminator
+
+class Discriminator(nn.Module):
+    def __init__(
+        self,
+        dims,
+        channels = 3,
+        groups = 16,
+        init_kernel_size = 5
+    ):
+        super().__init__()
+        dim_pairs = zip(dims[:-1], dims[1:])
+
+        self.layers = MList([nn.Sequential(nn.Conv2d(channels, dims[0], init_kernel_size, padding = init_kernel_size // 2), leaky_relu())])
+
+        for dim_in, dim_out in dim_pairs:
+            self.layers.append(nn.Sequential(
+                nn.Conv2d(dim_in, dim_out, 4, stride = 2, padding = 1),
+                nn.GroupNorm(groups, dim_out),
+                leaky_relu()
+            ))
+
+        dim = dims[-1]
+        self.to_logits = nn.Sequential( # return 5 x 5, for PatchGAN-esque training
+            nn.Conv2d(dim, dim, 1),
+            leaky_relu(),
+            nn.Conv2d(dim, 1, 4)
+        )
+
+    def forward(self, x):
+        for net in self.layers:
+            x = net(x)
+
+        return self.to_logits(x)
+
+# positional encoding
+
+class ContinuousPositionBias(nn.Module):
+    """ from https://arxiv.org/abs/2111.09883 """
+
+    def __init__(self, *, dim, heads, layers = 2):
+        super().__init__()
+        self.net = MList([])
+        self.net.append(nn.Sequential(nn.Linear(2, dim), leaky_relu()))
+
+        for _ in range(layers - 1):
+            self.net.append(nn.Sequential(nn.Linear(dim, dim), leaky_relu()))
+
+        self.net.append(nn.Linear(dim, heads))
+        self.register_buffer('rel_pos', None, persistent = False)
+
+    def forward(self, x):
+        n, device = x.shape[-1], x.device
+        fmap_size = int(sqrt(n))
+
+        if not exists(self.rel_pos):
+            pos = torch.arange(fmap_size, device = device)
+            grid = torch.stack(torch.meshgrid(pos, pos, indexing = 'ij'))
+            grid = rearrange(grid, 'c i j -> (i j) c')
+            rel_pos = rearrange(grid, 'i c -> i 1 c') - rearrange(grid, 'j c -> 1 j c')
+            rel_pos = torch.sign(rel_pos) * torch.log(rel_pos.abs() + 1)
+            self.register_buffer('rel_pos', rel_pos, persistent = False)
+
+        rel_pos = self.rel_pos.float()
+
+        for layer in self.net:
+            rel_pos = layer(rel_pos)
+
+        bias = rearrange(rel_pos, 'i j h -> h i j')
+        return x + bias
+
+# resnet encoder / decoder
+
+class ResnetEncDec(nn.Module):
+    def __init__(
+        self,
+        dim,
+        *,
+        channels = 3,
+        layers = 4,
+        layer_mults = None,
+        num_resnet_blocks = 1,
+        resnet_groups = 16,
+        first_conv_kernel_size = 5,
+        use_attn = True,
+        attn_dim_head = 64,
+        attn_heads = 8,
+        attn_dropout = 0.,
+    ):
+        super().__init__()
+        assert dim % resnet_groups == 0, f'dimension {dim} must be divisible by {resnet_groups} (groups for the groupnorm)'
+
+        self.layers = layers
+
+        self.encoders = MList([])
+        self.decoders = MList([])
+
+        layer_mults = default(layer_mults, list(map(lambda t: 2 ** t, range(layers))))
+        assert len(layer_mults) == layers, 'layer multipliers must be equal to designated number of layers'
+
+        layer_dims = [dim * mult for mult in layer_mults]
+        dims = (dim, *layer_dims)
+
+        self.encoded_dim = dims[-1]
+
+        dim_pairs = zip(dims[:-1], dims[1:])
+
+        append = lambda arr, t: arr.append(t)
+        prepend = lambda arr, t: arr.insert(0, t)
+
+        if not isinstance(num_resnet_blocks, tuple):
+            num_resnet_blocks = (*((0,) * (layers - 1)), num_resnet_blocks)
+
+        if not isinstance(use_attn, tuple):
+            use_attn = (*((False,) * (layers - 1)), use_attn)
+
+        assert len(num_resnet_blocks) == layers, 'number of resnet blocks config must be equal to number of layers'
+        assert len(use_attn) == layers
+
+        for layer_index, (dim_in, dim_out), layer_num_resnet_blocks, layer_use_attn in zip(range(layers), dim_pairs, num_resnet_blocks, use_attn):
+            append(self.encoders, nn.Sequential(nn.Conv2d(dim_in, dim_out, 4, stride = 2, padding = 1), leaky_relu()))
+            prepend(self.decoders, nn.Sequential(nn.ConvTranspose2d(dim_out, dim_in, 4, 2, 1), leaky_relu()))
+
+            if layer_use_attn:
+                prepend(self.decoders, VQGanAttention(dim = dim_out, heads = attn_heads, dim_head = attn_dim_head, dropout = attn_dropout))
+
+            for _ in range(layer_num_resnet_blocks):
+                append(self.encoders, ResBlock(dim_out, groups = resnet_groups))
+                prepend(self.decoders, GLUResBlock(dim_out, groups = resnet_groups))
+
+            if layer_use_attn:
+                append(self.encoders, VQGanAttention(dim = dim_out, heads = attn_heads, dim_head = attn_dim_head, dropout = attn_dropout))
+
+        prepend(self.encoders, nn.Conv2d(channels, dim, first_conv_kernel_size, padding = first_conv_kernel_size // 2))
+        append(self.decoders, nn.Conv2d(dim, channels, 1))
+
+    def get_encoded_fmap_size(self, image_size):
+        return image_size // (2 ** self.layers)
+
+    @property
+    def last_dec_layer(self):
+        return self.decoders[-1].weight
+
+    def encode(self, x):
+        for enc in self.encoders:
+            x = enc(x)
+        return x
+
+    def decode(self, x):
+        for dec in self.decoders:
+            x = dec(x)
+        return x
+
+class GLUResBlock(nn.Module):
+    def __init__(self, chan, groups = 16):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Conv2d(chan, chan * 2, 3, padding = 1),
+            nn.GLU(dim = 1),
+            nn.GroupNorm(groups, chan),
+            nn.Conv2d(chan, chan * 2, 3, padding = 1),
+            nn.GLU(dim = 1),
+            nn.GroupNorm(groups, chan),
+            nn.Conv2d(chan, chan, 1)
+        )
+
+    def forward(self, x):
+        return self.net(x) + x
+
+class ResBlock(nn.Module):
+    def __init__(self, chan, groups = 16):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Conv2d(chan, chan, 3, padding = 1),
+            nn.GroupNorm(groups, chan),
+            leaky_relu(),
+            nn.Conv2d(chan, chan, 3, padding = 1),
+            nn.GroupNorm(groups, chan),
+            leaky_relu(),
+            nn.Conv2d(chan, chan, 1)
+        )
+
+    def forward(self, x):
+        return self.net(x) + x
+
+# convnext enc dec
+
+class ChanLayerNorm(nn.Module):
+    def __init__(self, dim, eps = 1e-5):
+        super().__init__()
+        self.eps = eps
+        self.g = nn.Parameter(torch.ones(1, dim, 1, 1))
+
+    def forward(self, x):
+        var = torch.var(x, dim = 1, unbiased = False, keepdim = True)
+        mean = torch.mean(x, dim = 1, keepdim = True)
+        return (x - mean) / (var + self.eps).sqrt() * self.g
+
+class ConvNext(nn.Module):
+    def __init__(self, dim, mult = 4, kernel_size = 3, ds_kernel_size = 7):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        self.net = nn.Sequential(
+            nn.Conv2d(dim, dim, ds_kernel_size, padding = ds_kernel_size // 2, groups = dim),
+            ChanLayerNorm(dim),
+            nn.Conv2d(dim, inner_dim, kernel_size, padding = kernel_size // 2),
+            nn.GELU(),
+            nn.Conv2d(inner_dim, dim, kernel_size, padding = kernel_size // 2)
+        )
+
+    def forward(self, x):
+        return self.net(x) + x
+
+class ConvNextEncDec(nn.Module):
+    def __init__(
+        self,
+        dim,
+        *,
+        channels = 3,
+        layers = 4,
+        layer_mults = None,
+        num_blocks = 1,
+        first_conv_kernel_size = 5,
+        use_attn = True,
+        attn_dim_head = 64,
+        attn_heads = 8,
+        attn_dropout = 0.,
+    ):
+        super().__init__()
+
+        self.layers = layers
+
+        self.encoders = MList([])
+        self.decoders = MList([])
+
+        layer_mults = default(layer_mults, list(map(lambda t: 2 ** t, range(layers))))
+        assert len(layer_mults) == layers, 'layer multipliers must be equal to designated number of layers'
+
+        layer_dims = [dim * mult for mult in layer_mults]
+        dims = (dim, *layer_dims)
+
+        self.encoded_dim = dims[-1]
+
+        dim_pairs = zip(dims[:-1], dims[1:])
+
+        append = lambda arr, t: arr.append(t)
+        prepend = lambda arr, t: arr.insert(0, t)
+
+        if not isinstance(num_blocks, tuple):
+            num_blocks = (*((0,) * (layers - 1)), num_blocks)
+
+        if not isinstance(use_attn, tuple):
+            use_attn = (*((False,) * (layers - 1)), use_attn)
+
+        assert len(num_blocks) == layers, 'number of blocks config must be equal to number of layers'
+        assert len(use_attn) == layers
+
+        for layer_index, (dim_in, dim_out), layer_num_blocks, layer_use_attn in zip(range(layers), dim_pairs, num_blocks, use_attn):
+            append(self.encoders, nn.Sequential(nn.Conv2d(dim_in, dim_out, 4, stride = 2, padding = 1), leaky_relu()))
+            prepend(self.decoders, nn.Sequential(nn.ConvTranspose2d(dim_out, dim_in, 4, 2, 1), leaky_relu()))
+
+            if layer_use_attn:
+                prepend(self.decoders, VQGanAttention(dim = dim_out, heads = attn_heads, dim_head = attn_dim_head, dropout = attn_dropout))
+
+            for _ in range(layer_num_blocks):
+                append(self.encoders, ConvNext(dim_out))
+                prepend(self.decoders, ConvNext(dim_out))
+
+            if layer_use_attn:
+                append(self.encoders, VQGanAttention(dim = dim_out, heads = attn_heads, dim_head = attn_dim_head, dropout = attn_dropout))
+
+        prepend(self.encoders, nn.Conv2d(channels, dim, first_conv_kernel_size, padding = first_conv_kernel_size // 2))
+        append(self.decoders, nn.Conv2d(dim, channels, 1))
+
+    def get_encoded_fmap_size(self, image_size):
+        return image_size // (2 ** self.layers)
+
+    @property
+    def last_dec_layer(self):
+        return self.decoders[-1].weight
+
+    def encode(self, x):
+        for enc in self.encoders:
+            x = enc(x)
+        return x
+
+    def decode(self, x):
+        for dec in self.decoders:
+            x = dec(x)
+        return x
+
+# vqgan attention layer
+
+class VQGanAttention(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        dim_head = 64,
+        heads = 8,
+        dropout = 0.
+    ):
+        super().__init__()
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        inner_dim = heads * dim_head
+
+        self.dropout = nn.Dropout(dropout)
+        self.pre_norm = LayerNormChan(dim)
+
+        self.cpb = ContinuousPositionBias(dim = dim // 4, heads = heads)
+        self.to_qkv = nn.Conv2d(dim, inner_dim * 3, 1, bias = False)
+        self.to_out = nn.Conv2d(inner_dim, dim, 1, bias = False)
+
+    def forward(self, x):
+        h = self.heads
+        height, width, residual = *x.shape[-2:], x.clone()
+
+        x = self.pre_norm(x)
+
+        q, k, v = self.to_qkv(x).chunk(3, dim = 1)
+
+        q, k, v = map(lambda t: rearrange(t, 'b (h c) x y -> b h c (x y)', h = h), (q, k, v))
+
+        sim = einsum('b h c i, b h c j -> b h i j', q, k) * self.scale
+
+        sim = self.cpb(sim)
+
+        attn = stable_softmax(sim, dim = -1)
+        attn = self.dropout(attn)
+
+        out = einsum('b h i j, b h c j -> b h c i', attn, v)
+        out = rearrange(out, 'b h c (x y) -> b (h c) x y', x = height, y = width)
+        out = self.to_out(out)
+
+        return out + residual
+
+# ViT encoder / decoder
+
+class RearrangeImage(nn.Module):
+    def forward(self, x):
+        n = x.shape[1]
+        w = h = int(sqrt(n))
+        return rearrange(x, 'b (h w) ... -> b h w ...', h = h, w = w)
+
+class Attention(nn.Module):
+    def __init__(
+        self,
+        dim,
+        *,
+        heads = 8,
+        dim_head = 32
+    ):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        inner_dim = dim_head * heads
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+        self.to_out = nn.Linear(inner_dim, dim)
+
+    def forward(self, x):
+        h = self.heads
+
+        x = self.norm(x)
+
+        q, k, v = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = rearrange_many((q, k, v), 'b n (h d) -> b h n d', h = h)
+
+        q = q * self.scale
+        sim = einsum('b h i d, b h j d -> b h i j', q, k)
+
+        sim = sim - sim.amax(dim = -1, keepdim = True).detach()
+        attn = sim.softmax(dim = -1)
+
+        out = einsum('b h i j, b h j d -> b h i d', attn, v)
+
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+def FeedForward(dim, mult = 4):
+    return nn.Sequential(
+        nn.LayerNorm(dim),
+        nn.Linear(dim, dim * mult, bias = False),
+        nn.GELU(),
+        nn.Linear(dim * mult, dim, bias = False)
+    )
+
+class Transformer(nn.Module):
+    def __init__(
+        self,
+        dim,
+        *,
+        layers,
+        dim_head = 32,
+        heads = 8,
+        ff_mult = 4
+    ):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(layers):
+            self.layers.append(nn.ModuleList([
+                Attention(dim = dim, dim_head = dim_head, heads = heads),
+                FeedForward(dim = dim, mult = ff_mult)
+            ]))
+
+        self.norm = nn.LayerNorm(dim)
+
+    def forward(self, x):
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+
+        return self.norm(x)
+
+class ViTEncDec(nn.Module):
+    def __init__(
+        self,
+        dim,
+        channels = 3,
+        layers = 4,
+        patch_size = 8,
+        dim_head = 32,
+        heads = 8,
+        ff_mult = 4
+    ):
+        super().__init__()
+        self.encoded_dim = dim
+        self.patch_size = patch_size
+
+        input_dim = channels * (patch_size ** 2)
+
+        self.encoder = nn.Sequential(
+            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
+            nn.Linear(input_dim, dim),
+            Transformer(
+                dim = dim,
+                dim_head = dim_head,
+                heads = heads,
+                ff_mult = ff_mult,
+                layers = layers
+            ),
+            RearrangeImage(),
+            Rearrange('b h w c -> b c h w')
+        )
+
+        self.decoder = nn.Sequential(
+            Rearrange('b c h w -> b (h w) c'),
+            Transformer(
+                dim = dim,
+                dim_head = dim_head,
+                heads = heads,
+                ff_mult = ff_mult,
+                layers = layers
+            ),
+            nn.Sequential(
+                nn.Linear(dim, dim * 4, bias = False),
+                nn.Tanh(),
+                nn.Linear(dim * 4, input_dim, bias = False),
+            ),
+            RearrangeImage(),
+            Rearrange('b h w (p1 p2 c) -> b c (h p1) (w p2)', p1 = patch_size, p2 = patch_size)
+        )
+
+    def get_encoded_fmap_size(self, image_size):
+        return image_size // self.patch_size
+
+    @property
+    def last_dec_layer(self):
+        return self.decoder[-3][-1].weight
+
+    def encode(self, x):
+        return self.encoder(x)
+
+    def decode(self, x):
+        return self.decoder(x)
+
+# main vqgan-vae classes
+
+class NullVQGanVAE(nn.Module):
+    def __init__(
+        self,
+        *,
+        channels
+    ):
+        super().__init__()
+        self.encoded_dim = channels
+        self.layers = 0
+
+    def get_encoded_fmap_size(self, size):
+        return size
+
+    def copy_for_eval(self):
+        return self
+
+    def encode(self, x):
+        return x
+
+    def decode(self, x):
+        return x
+
+class VQGanVAE(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        image_size,
+        channels = 3,
+        layers = 4,
+        l2_recon_loss = False,
+        use_hinge_loss = True,
+        vgg = None,
+        vq_codebook_dim = 256,
+        vq_codebook_size = 512,
+        vq_decay = 0.8,
+        vq_commitment_weight = 1.,
+        vq_kmeans_init = True,
+        vq_use_cosine_sim = True,
+        use_vgg_and_gan = True,
+        vae_type = 'resnet',
+        discr_layers = 4,
+        **kwargs
+    ):
+        super().__init__()
+        vq_kwargs, kwargs = groupby_prefix_and_trim('vq_', kwargs)
+        encdec_kwargs, kwargs = groupby_prefix_and_trim('encdec_', kwargs)
+
+        self.image_size = image_size
+        self.channels = channels
+        self.codebook_size = vq_codebook_size
+
+        if vae_type == 'resnet':
+            enc_dec_klass = ResnetEncDec
+        elif vae_type == 'vit':
+            enc_dec_klass = ViTEncDec
+        elif vae_type == 'convnext':
+            enc_dec_klass = ConvNextEncDec
+        else:
+            raise ValueError(f'{vae_type} not valid')
+
+        self.enc_dec = enc_dec_klass(
+            dim = dim,
+            channels = channels,
+            layers = layers,
+            **encdec_kwargs
+        )
+
+        self.vq = VQ(
+            dim = self.enc_dec.encoded_dim,
+            codebook_dim = vq_codebook_dim,
+            codebook_size = vq_codebook_size,
+            decay = vq_decay,
+            commitment_weight = vq_commitment_weight,
+            accept_image_fmap = True,
+            kmeans_init = vq_kmeans_init,
+            use_cosine_sim = vq_use_cosine_sim,
+            **vq_kwargs
+        )
+
+        # reconstruction loss
+
+        self.recon_loss_fn = F.mse_loss if l2_recon_loss else F.l1_loss
+
+        # turn off GAN and perceptual loss if grayscale
+
+        self.vgg = None
+        self.discr = None
+        self.use_vgg_and_gan = use_vgg_and_gan
+
+        if not use_vgg_and_gan:
+            return
+
+        # preceptual loss
+
+        if exists(vgg):
+            self.vgg = vgg
+        else:
+            self.vgg = torchvision.models.vgg16(pretrained = True)
+            self.vgg.classifier = nn.Sequential(*self.vgg.classifier[:-2])
+
+        # gan related losses
+
+        layer_mults = list(map(lambda t: 2 ** t, range(discr_layers)))
+        layer_dims = [dim * mult for mult in layer_mults]
+        dims = (dim, *layer_dims)
+
+        self.discr = Discriminator(dims = dims, channels = channels)
+
+        self.discr_loss = hinge_discr_loss if use_hinge_loss else bce_discr_loss
+        self.gen_loss = hinge_gen_loss if use_hinge_loss else bce_gen_loss
+
+    @property
+    def encoded_dim(self):
+        return self.enc_dec.encoded_dim
+
+    def get_encoded_fmap_size(self, image_size):
+        return self.enc_dec.get_encoded_fmap_size(image_size)
+
+    def copy_for_eval(self):
+        device = next(self.parameters()).device
+        vae_copy = copy.deepcopy(self.cpu())
+
+        if vae_copy.use_vgg_and_gan:
+            del vae_copy.discr
+            del vae_copy.vgg
+
+        vae_copy.eval()
+        return vae_copy.to(device)
+
+    @remove_vgg
+    def state_dict(self, *args, **kwargs):
+        return super().state_dict(*args, **kwargs)
+
+    @remove_vgg
+    def load_state_dict(self, *args, **kwargs):
+        return super().load_state_dict(*args, **kwargs)
+
+    @property
+    def codebook(self):
+        return self.vq.codebook
+
+    def encode(self, fmap):
+        fmap = self.enc_dec.encode(fmap)
+        return fmap
+
+    def decode(self, fmap, return_indices_and_loss = False):
+        fmap, indices, commit_loss = self.vq(fmap)
+
+        fmap = self.enc_dec.decode(fmap)
+
+        if not return_indices_and_loss:
+            return fmap
+
+        return fmap, indices, commit_loss
+
+    def forward(
+        self,
+        img,
+        return_loss = False,
+        return_discr_loss = False,
+        return_recons = False,
+        add_gradient_penalty = True
+    ):
+        batch, channels, height, width, device = *img.shape, img.device
+        assert height == self.image_size and width == self.image_size, 'height and width of input image must be equal to {self.image_size}'
+        assert channels == self.channels, 'number of channels on image or sketch is not equal to the channels set on this VQGanVAE'
+
+        fmap = self.encode(img)
+
+        fmap, indices, commit_loss = self.decode(fmap, return_indices_and_loss = True)
+
+        if not return_loss and not return_discr_loss:
+            return fmap
+
+        assert return_loss ^ return_discr_loss, 'you should either return autoencoder loss or discriminator loss, but not both'
+
+        # whether to return discriminator loss
+
+        if return_discr_loss:
+            assert exists(self.discr), 'discriminator must exist to train it'
+
+            fmap.detach_()
+            img.requires_grad_()
+
+            fmap_discr_logits, img_discr_logits = map(self.discr, (fmap, img))
+
+            discr_loss = self.discr_loss(fmap_discr_logits, img_discr_logits)
+
+            if add_gradient_penalty:
+                gp = gradient_penalty(img, img_discr_logits)
+                loss = discr_loss + gp
+
+            if return_recons:
+                return loss, fmap
+
+            return loss
+
+        # reconstruction loss
+
+        recon_loss = self.recon_loss_fn(fmap, img)
+
+        # early return if training on grayscale
+
+        if not self.use_vgg_and_gan:
+            if return_recons:
+                return recon_loss, fmap
+
+            return recon_loss
+
+        # perceptual loss
+
+        img_vgg_input = img
+        fmap_vgg_input = fmap
+
+        if img.shape[1] == 1:
+            # handle grayscale for vgg
+            img_vgg_input, fmap_vgg_input = map(lambda t: repeat(t, 'b 1 ... -> b c ...', c = 3), (img_vgg_input, fmap_vgg_input))
+
+        img_vgg_feats = self.vgg(img_vgg_input)
+        recon_vgg_feats = self.vgg(fmap_vgg_input)
+        perceptual_loss = F.mse_loss(img_vgg_feats, recon_vgg_feats)
+
+        # generator loss
+
+        gen_loss = self.gen_loss(self.discr(fmap))
+
+        # calculate adaptive weight
+
+        last_dec_layer = self.enc_dec.last_dec_layer
+
+        norm_grad_wrt_gen_loss = grad_layer_wrt_loss(gen_loss, last_dec_layer).norm(p = 2)
+        norm_grad_wrt_perceptual_loss = grad_layer_wrt_loss(perceptual_loss, last_dec_layer).norm(p = 2)
+
+        adaptive_weight = safe_div(norm_grad_wrt_perceptual_loss, norm_grad_wrt_gen_loss)
+        adaptive_weight.clamp_(max = 1e4)
+
+        # combine losses
+
+        loss = recon_loss + perceptual_loss + commit_loss + adaptive_weight * gen_loss
+
+        if return_recons:
+            return loss, fmap
+
+        return loss
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.0.7',
+  version = '0.0.98',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
@@ -23,13 +23,18 @@ setup(
  ],
  install_requires=[
    'click',
+    'clip-anytorch',
    'einops>=0.4',
    'einops-exts>=0.0.3',
+    'embedding-reader',
+    'kornia>=0.5.4',
    'pillow',
    'torch>=1.10',
    'torchvision',
    'tqdm',
-    'x-clip>=0.4.4',
+    'vector-quantize-pytorch',
+    'webdataset',
+    'x-clip>=0.5.1',
    'youtokentome'
  ],
  classifiers=[
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -0,0 +1,291 @@
+import os
+import math
+import argparse
+import numpy as np
+
+import torch
+from torch import nn
+from embedding_reader import EmbeddingReader
+from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork
+from dalle2_pytorch.optimizer import get_optimizer
+from torch.cuda.amp import autocast,GradScaler
+
+import time
+from tqdm import tqdm
+
+import wandb
+os.environ["WANDB_SILENT"] = "true"
+NUM_TEST_EMBEDDINGS = 100 # for cosine similarity reporting during training
+REPORT_METRICS_EVERY = 100 # for cosine similarity and other metric reporting during training
+
+
+def eval_model(model,device,image_reader,text_reader,start,end,batch_size,loss_type,phase="Validation"):
+    model.eval()
+    with torch.no_grad():
+        total_loss = 0.
+        total_samples = 0.
+
+        for emb_images, emb_text in zip(image_reader(batch_size=batch_size, start=start, end=end),
+                text_reader(batch_size=batch_size, start=start, end=end)):
+
+            emb_images_tensor = torch.tensor(emb_images[0]).to(device)
+            emb_text_tensor = torch.tensor(emb_text[0]).to(device)
+
+            batches = emb_images_tensor.shape[0]
+
+            loss = model(text_embed = emb_text_tensor, image_embed = emb_images_tensor)
+
+            total_loss += loss.item() * batches
+            total_samples += batches
+
+        avg_loss = (total_loss / total_samples)
+        wandb.log({f'{phase} {loss_type}': avg_loss})
+
+def save_model(save_path, state_dict):
+    # Saving State Dict
+    print("====================================== Saving checkpoint ======================================")
+    torch.save(state_dict, save_path+'/'+str(time.time())+'_saved_model.pth')
+
+def report_cosine_sims(diffusion_prior,image_reader,text_reader,train_set_size,val_set_size,NUM_TEST_EMBEDDINGS,device):
+    cos = nn.CosineSimilarity(dim=1, eps=1e-6)
+
+    tstart = train_set_size+val_set_size
+    tend = train_set_size+val_set_size+NUM_TEST_EMBEDDINGS
+
+    for embt, embi in zip(text_reader(batch_size = NUM_TEST_EMBEDDINGS, start=tstart, end = tend),image_reader(batch_size = NUM_TEST_EMBEDDINGS, start=tstart, end = tend)):
+        text_embed = torch.tensor(embt[0]).to(device)
+        text_embed = text_embed /  text_embed.norm(dim=1, keepdim=True)
+        test_text_cond = dict(text_embed = text_embed)
+
+        test_image_embeddings = torch.tensor(embi[0]).to(device)
+        test_image_embeddings = test_image_embeddings /  test_image_embeddings.norm(dim=1, keepdim=True)
+
+        predicted_image_embeddings = diffusion_prior.p_sample_loop((NUM_TEST_EMBEDDINGS, 768), text_cond = test_text_cond)
+        predicted_image_embeddings = predicted_image_embeddings / predicted_image_embeddings.norm(dim=1, keepdim=True)
+
+        original_similarity = cos(text_embed,test_image_embeddings).cpu().numpy()
+        predicted_similarity = cos(text_embed,predicted_image_embeddings).cpu().numpy()
+
+        wandb.log({"CosineSimilarity(text_embed,image_embed)": np.mean(original_similarity)})
+        wandb.log({"CosineSimilarity(text_embed,predicted_image_embed)":np.mean(predicted_similarity)})
+
+    return np.mean(predicted_similarity - original_similarity)
+
+
+
+def train(image_embed_dim,
+          image_embed_url,
+          text_embed_url,
+          batch_size,
+          train_percent,
+          val_percent,
+          test_percent,
+          num_epochs,
+          dp_loss_type,
+          clip,
+          dp_condition_on_text_encodings,
+          dp_timesteps,
+          dp_l2norm_output,
+          dp_normformer,
+          dp_cond_drop_prob,
+          dpn_depth,
+          dpn_dim_head,
+          dpn_heads,
+          save_interval,
+          save_path,
+          device,
+          learning_rate=0.001,
+          max_grad_norm=0.5,
+          weight_decay=0.01,
+          amp=False):
+
+    # DiffusionPriorNetwork 
+    prior_network = DiffusionPriorNetwork( 
+            dim = image_embed_dim, 
+            depth = dpn_depth, 
+            dim_head = dpn_dim_head, 
+            heads = dpn_heads,
+            normformer = dp_normformer,
+            l2norm_output = dp_l2norm_output).to(device)
+    
+    # DiffusionPrior with text embeddings and image embeddings pre-computed
+    diffusion_prior = DiffusionPrior( 
+            net = prior_network, 
+            clip = clip, 
+            image_embed_dim = image_embed_dim, 
+            timesteps = dp_timesteps,
+            cond_drop_prob = dp_cond_drop_prob, 
+            loss_type = dp_loss_type, 
+            condition_on_text_encodings = dp_condition_on_text_encodings).to(device)
+
+    # Get image and text embeddings from the servers
+    print("==============Downloading embeddings - image and text====================")
+    image_reader = EmbeddingReader(embeddings_folder=image_embed_url, file_format="npy")
+    text_reader  = EmbeddingReader(embeddings_folder=text_embed_url, file_format="npy")
+    num_data_points = text_reader.count
+
+    # Create save_path if it doesn't exist
+    if not os.path.exists(save_path):
+        os.makedirs(save_path)
+
+    ### Training code ###
+    scaler = GradScaler(enabled=amp)
+    optimizer = get_optimizer(diffusion_prior.net.parameters(), wd=weight_decay, lr=learning_rate)
+    epochs = num_epochs
+
+    step = 0
+    t = time.time()
+
+    train_set_size = int(train_percent*num_data_points)
+    val_set_size = int(val_percent*num_data_points)
+
+    for _ in range(epochs):
+        diffusion_prior.train()
+
+        for emb_images,emb_text in zip(image_reader(batch_size=batch_size, start=0, end=train_set_size),
+                text_reader(batch_size=batch_size, start=0, end=train_set_size)):
+            emb_images_tensor = torch.tensor(emb_images[0]).to(device)
+            emb_text_tensor = torch.tensor(emb_text[0]).to(device)
+
+            with autocast(enabled=amp):
+                loss = diffusion_prior(text_embed = emb_text_tensor,image_embed = emb_images_tensor)
+                scaler.scale(loss).backward()
+
+            # Samples per second
+            step+=1
+            samples_per_sec = batch_size*step/(time.time()-t)
+            # Save checkpoint every save_interval minutes
+            if(int(time.time()-t) >= 60*save_interval):
+                t = time.time()
+
+                save_model(
+                    save_path,
+                    dict(model=diffusion_prior.state_dict(), optimizer=optimizer.state_dict(), scaler=scaler.state_dict()))
+
+            # Log to wandb
+            wandb.log({"Training loss": loss.item(),
+                        "Steps": step,
+                        "Samples per second": samples_per_sec})
+            # Log cosineSim(text_embed,predicted_image_embed) - cosineSim(text_embed,image_embed)
+            # Use NUM_TEST_EMBEDDINGS samples from the test set each time
+            # Get embeddings from the most recently saved model
+            if(step % REPORT_METRICS_EVERY) == 0:
+                diff_cosine_sim = report_cosine_sims(diffusion_prior,
+                        image_reader,
+                        text_reader,
+                        train_set_size,
+                        val_set_size,
+                        NUM_TEST_EMBEDDINGS,
+                        device)
+                wandb.log({"Cosine similarity difference": diff_cosine_sim})
+
+            scaler.unscale_(optimizer)
+            nn.utils.clip_grad_norm_(diffusion_prior.parameters(), max_grad_norm)
+
+            scaler.step(optimizer)
+            scaler.update()
+            optimizer.zero_grad()
+
+        ### Evaluate model(validation run) ###
+        start = train_set_size
+        end=start+val_set_size
+        eval_model(diffusion_prior,device,image_reader,text_reader,start,end,batch_size,dp_loss_type,phase="Validation")
+
+    ### Test run ###
+    test_set_size = int(test_percent*train_set_size) 
+    start=train_set_size+val_set_size
+    end=num_data_points
+    eval_model(diffusion_prior,device,image_reader,text_reader,start,end,batch_size,dp_loss_type,phase="Test")
+
+def main():
+    parser = argparse.ArgumentParser()
+    # Logging
+    parser.add_argument("--wandb-entity", type=str, default="laion")
+    parser.add_argument("--wandb-project", type=str, default="diffusion-prior")
+    parser.add_argument("--wandb-name", type=str, default="laion-dprior")
+    parser.add_argument("--wandb-dataset", type=str, default="LAION-5B")
+    parser.add_argument("--wandb-arch", type=str, default="DiffusionPrior")
+    # URLs for embeddings 
+    parser.add_argument("--image-embed-url", type=str, default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/")
+    parser.add_argument("--text-embed-url", type=str, default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/")
+    # Hyperparameters
+    parser.add_argument("--learning-rate", type=float, default=1.1e-4)
+    parser.add_argument("--weight-decay", type=float, default=6.02e-2)
+    parser.add_argument("--max-grad-norm", type=float, default=0.5)
+    parser.add_argument("--batch-size", type=int, default=10**4)
+    parser.add_argument("--num-epochs", type=int, default=5)
+    # Image embed dimension
+    parser.add_argument("--image-embed-dim", type=int, default=768)
+    # Train-test split
+    parser.add_argument("--train-percent", type=float, default=0.7)
+    parser.add_argument("--val-percent", type=float, default=0.2)
+    parser.add_argument("--test-percent", type=float, default=0.1)
+    # LAION training(pre-computed embeddings)
+    # DiffusionPriorNetwork(dpn) parameters
+    parser.add_argument("--dpn-depth", type=int, default=6)
+    parser.add_argument("--dpn-dim-head", type=int, default=64)
+    parser.add_argument("--dpn-heads", type=int, default=8)
+    # DiffusionPrior(dp) parameters
+    parser.add_argument("--dp-condition-on-text-encodings", type=bool, default=False)
+    parser.add_argument("--dp-timesteps", type=int, default=100)
+    parser.add_argument("--dp-l2norm-output", type=bool, default=False)
+    parser.add_argument("--dp-normformer", type=bool, default=False)
+    parser.add_argument("--dp-cond-drop-prob", type=float, default=0.1)
+    parser.add_argument("--dp-loss-type", type=str, default="l2")
+    parser.add_argument("--clip", type=str, default=None)
+    parser.add_argument("--amp", type=bool, default=False)
+    # Model checkpointing interval(minutes)
+    parser.add_argument("--save-interval", type=int, default=30)
+    parser.add_argument("--save-path", type=str, default="./diffusion_prior_checkpoints")
+
+    args = parser.parse_args()
+
+    print("Setting up wandb logging... Please wait...")
+
+    wandb.init(
+      entity=args.wandb_entity,
+      project=args.wandb_project,
+      config={
+      "learning_rate": args.learning_rate,
+      "architecture": args.wandb_arch,
+      "dataset": args.wandb_dataset,
+      "epochs": args.num_epochs,
+      })
+
+    print("wandb logging setup done!")
+    # Obtain the utilized device.
+
+    has_cuda = torch.cuda.is_available()
+    if has_cuda:
+        device = torch.device("cuda:0")
+        torch.cuda.set_device(device)
+
+    # Training loop
+    train(args.image_embed_dim,
+          args.image_embed_url,
+          args.text_embed_url,
+          args.batch_size,
+          args.train_percent,
+          args.val_percent,
+          args.test_percent,
+          args.num_epochs,
+          args.dp_loss_type,
+          args.clip,
+          args.dp_condition_on_text_encodings,
+          args.dp_timesteps,
+          args.dp_l2norm_output,
+          args.dp_normformer,
+          args.dp_cond_drop_prob,
+          args.dpn_depth,
+          args.dpn_dim_head,
+          args.dpn_heads,
+          args.save_interval,
+          args.save_path,
+          device,
+          args.learning_rate,
+          args.max_grad_norm,
+          args.weight_decay,
+          args.amp)
+
+if __name__ == "__main__":
+  main()
Author	SHA1	Message	Date
Phil Wang	86e692d24f	fix random crop probability	2022-05-04 11:52:24 -07:00
Phil Wang	97b751209f	allow for last unet in the cascade to be trained on crops, if it is convolution-only	2022-05-04 11:48:48 -07:00
Phil Wang	74103fd8d6	product management	2022-05-04 11:20:50 -07:00
Phil Wang	1992d25cad	project management	2022-05-04 11:18:54 -07:00
Phil Wang	5b619c2fd5	make sure some hyperparameters for unet block is configurable	2022-05-04 11:18:32 -07:00
Phil Wang	9359ad2e91	0.0.95	2022-05-04 10:53:05 -07:00
Phil Wang	9ff228188b	offer old resnet blocks, from the original DDPM paper, just in case convnexts are unsuitable for generative work	2022-05-04 10:52:58 -07:00
Kumar R	2d9963d30e	Reporting metrics - Cosine similarity. (#55 ) * Update train_diffusion_prior.py * Delete train_diffusion_prior.py * Cosine similarity logging. * Update train_diffusion_prior.py * Report Cosine metrics every N steps.	2022-05-04 08:04:36 -07:00
Phil Wang	58d9b422f3	0.0.94	2022-05-04 07:42:33 -07:00
Ray Bell	44b319cb57	add missing import (#56 )	2022-05-04 07:42:20 -07:00
Phil Wang	c30f380689	final reminder	2022-05-03 08:18:53 -07:00
Phil Wang	e4e884bb8b	keep all doors open	2022-05-03 08:17:02 -07:00
Phil Wang	803ad9c17d	product management again	2022-05-03 08:15:25 -07:00
Phil Wang	a88dd6a9c0	todo	2022-05-03 08:09:02 -07:00
Kumar R	72c16b496e	Update train_diffusion_prior.py (#53 )	2022-05-02 22:44:57 -07:00
z	81d83dd7f2	defaults align with paper (#52 ) Co-authored-by: nousr <>	2022-05-02 13:52:11 -07:00
Phil Wang	fa66f7e1e9	todo	2022-05-02 12:57:15 -07:00
Phil Wang	aa8d135245	allow laion to experiment with normformer in diffusion prior	2022-05-02 11:35:00 -07:00
Phil Wang	70282de23b	add ability to turn on normformer settings, given @borisdayma reported good results and some personal anecdata	2022-05-02 11:33:15 -07:00
Phil Wang	83f761847e	todo	2022-05-02 10:52:39 -07:00
Phil Wang	11469dc0c6	makes more sense to keep this as True as default, for stability	2022-05-02 10:50:55 -07:00
Romain Beaumont	2d25c89f35	Fix passing of l2norm_output to DiffusionPriorNetwork (#51 )	2022-05-02 10:48:16 -07:00
Phil Wang	3fe96c208a	add ability to train diffusion prior with l2norm on output image embed	2022-05-02 09:53:20 -07:00
Phil Wang	0fc6c9cdf3	provide option to l2norm the output of the diffusion prior	2022-05-02 09:41:03 -07:00
Phil Wang	7ee0ecc388	mixed precision for training diffusion prior + save optimizer and scaler states	2022-05-02 09:31:04 -07:00
Phil Wang	1924c7cc3d	fix issue with mixed precision and gradient clipping	2022-05-02 09:20:19 -07:00
Phil Wang	f7df3caaf3	address not calculating average eval / test loss when training diffusion prior https://github.com/lucidrains/DALLE2-pytorch/issues/49	2022-05-02 08:51:41 -07:00
Phil Wang	fc954ee788	fix calculation of adaptive weight for vit-vqgan, thanks to @CiaoHe	2022-05-02 07:58:14 -07:00
Phil Wang	c1db2753f5	todo	2022-05-01 18:02:30 -07:00
Phil Wang	ad87bfe28f	switch to using linear attention for the sparse attention layers within unet, given success in GAN projects	2022-05-01 17:59:03 -07:00
Phil Wang	76c767b1ce	update deps, commit to using webdatasets, per @rom1504 consultation	2022-05-01 12:22:15 -07:00
Phil Wang	d991b8c39c	just clip the diffusion prior network parameters	2022-05-01 12:01:08 -07:00
Phil Wang	902693e271	todo	2022-05-01 11:57:08 -07:00
Phil Wang	35cd63982d	add gradient clipping, make sure weight decay is configurable, make sure learning rate is actually passed into get_optimizer, make sure model is set to training mode at beginning of each epoch	2022-05-01 11:55:38 -07:00
Kumar R	53ce6dfdf6	All changes implemented, current run happening. Link to wandb run in comments. (#43 ) * Train DiffusionPrior with pre-computed embeddings This is in response to https://github.com/lucidrains/DALLE2-pytorch/issues/29 - more metrics will get added.	2022-05-01 11:46:59 -07:00
Phil Wang	ad8d7a368b	product management	2022-05-01 11:26:21 -07:00
Phil Wang	b8cf1e5c20	more attention	2022-05-01 11:00:33 -07:00
Phil Wang	94aaa08d97	product management	2022-05-01 09:43:10 -07:00
Phil Wang	8b9bbec7d1	project management	2022-05-01 09:32:57 -07:00
Phil Wang	1bb9fc9829	add convnext backbone for vqgan-vae, still need to fix groupnorms in resnet encdec	2022-05-01 09:32:24 -07:00
Phil Wang	5e421bd5bb	let researchers do the hyperparameter search	2022-05-01 08:46:21 -07:00
Phil Wang	67fcab1122	add MLP based time conditioning to all convnexts, in addition to cross attention. also add an initial convolution, given convnext first depthwise conv	2022-05-01 08:41:02 -07:00
Phil Wang	5bfbccda22	port over vqgan vae trainer	2022-05-01 08:09:15 -07:00
Phil Wang	989275ff59	product management	2022-04-30 16:57:56 -07:00
Phil Wang	56408f4a40	project management	2022-04-30 16:57:02 -07:00
Phil Wang	d1a697ac23	allows one to shortcut sampling at a specific unet number, if one were to be training in stages	2022-04-30 16:05:13 -07:00
Phil Wang	ebe01749ed	DecoderTrainer sample method uses the exponentially moving averaged	2022-04-30 14:55:34 -07:00
Phil Wang	63195cc2cb	allow for division of loss prior to scaling, for gradient accumulation purposes	2022-04-30 12:56:47 -07:00
Phil Wang	a2ef69af66	take care of mixed precision, and make gradient accumulation do-able externally	2022-04-30 12:27:24 -07:00
Phil Wang	5fff22834e	be able to finely customize learning parameters for each unet, take care of gradient clipping	2022-04-30 11:56:05 -07:00
Phil Wang	a9421f49ec	simplify Decoder training for the public	2022-04-30 11:45:18 -07:00
Phil Wang	77fa34eae9	fix all clipping / clamping issues	2022-04-30 10:08:24 -07:00
Phil Wang	1c1e508369	fix all issues with text encodings conditioning in the decoder, using null padding tokens technique from dalle v1	2022-04-30 09:13:34 -07:00
Phil Wang	f19c99ecb0	fix decoder needing separate conditional dropping probabilities for image embeddings and text encodings, thanks to @xiankgx !	2022-04-30 08:48:05 -07:00
Phil Wang	721a444686	Merge pull request #37 from ProGamerGov/patch-1 Fix spelling and grammatical errors	2022-04-30 08:19:07 -07:00
ProGamerGov	63450b466d	Fix spelling and grammatical errors	2022-04-30 09:18:13 -06:00
Phil Wang	20e7eb5a9b	cleanup	2022-04-30 07:22:57 -07:00
Phil Wang	e2f9615afa	use @clip-anytorch , thanks to @rom1504	2022-04-30 06:40:54 -07:00
Phil Wang	0d1c07c803	fix a bug with classifier free guidance, thanks to @xiankgx again!	2022-04-30 06:34:57 -07:00
Phil Wang	a389f81138	todo	2022-04-29 15:40:51 -07:00
Phil Wang	0283556608	fix example in readme, since api changed	2022-04-29 13:40:55 -07:00
Phil Wang	5063d192b6	now completely OpenAI CLIP compatible for training just take care of the logic for AdamW and transformers used namedtuples for clip adapter embedding outputs	2022-04-29 13:05:01 -07:00
Phil Wang	f4a54e475e	add some training fns	2022-04-29 09:44:55 -07:00
Phil Wang	fb662a62f3	fix another bug thanks to @xiankgx	2022-04-29 07:38:32 -07:00
Phil Wang	587c8c9b44	optimize for clarity	2022-04-28 21:59:13 -07:00
Phil Wang	aa900213e7	force first unet in the cascade to be conditioned on image embeds	2022-04-28 20:53:15 -07:00
Phil Wang	cb26187450	vqgan-vae codebook dims should be 256 or smaller	2022-04-28 08:59:03 -07:00
Phil Wang	625ce23f6b	🐛	2022-04-28 07:21:18 -07:00
Phil Wang	dbf4a281f1	make sure another CLIP can actually be passed in, as long as it is wrapped in an adapter extended from BaseClipAdapter	2022-04-27 20:45:27 -07:00
Phil Wang	4ab527e779	some extra asserts for text encoding of diffusion prior and decoder	2022-04-27 20:11:43 -07:00
Phil Wang	d0cdeb3247	add ability for DALL-E2 to return PIL images with `return_pil_images = True` on forward, for those who have no clue about deep learning	2022-04-27 19:58:06 -07:00
Phil Wang	8c610aad9a	only pass text encodings conditioning in diffusion prior if specified on initialization	2022-04-27 19:48:16 -07:00
Phil Wang	6700381a37	prepare for ability to integrate other clips other than x-clip	2022-04-27 19:35:05 -07:00
Phil Wang	20377f889a	todo	2022-04-27 17:22:14 -07:00
Phil Wang	6edb1c5dd0	fix issue with ema class	2022-04-27 16:40:02 -07:00
Phil Wang	b093f92182	inform what is possible	2022-04-27 08:25:16 -07:00
Phil Wang	fa3bb6ba5c	make sure cpu-only still works	2022-04-27 08:02:10 -07:00
Phil Wang	2705e7c9b0	attention-based upsampling claims unsupported by local experiments, removing	2022-04-27 07:51:04 -07:00
Phil Wang	77141882c8	complete vit-vqgan from https://arxiv.org/abs/2110.04627	2022-04-26 17:20:47 -07:00
Phil Wang	4075d02139	nevermind, it could be working, but only when i stabilize it with the feedforward layer + tanh as proposed in vit-vqgan paper (which will be built into the repository later for the latent diffusion)	2022-04-26 12:43:31 -07:00
Phil Wang	de0296106b	be able to turn off warning for use of LazyLinear by passing in text embedding dimension for unet	2022-04-26 11:42:46 -07:00
Phil Wang	eafb136214	suppress a warning	2022-04-26 11:40:45 -07:00
Phil Wang	bfbcc283a3	DRY a tiny bit for gaussian diffusion related logic	2022-04-26 11:39:12 -07:00
Phil Wang	c30544b73a	no CLIP altogether for training DiffusionPrior	2022-04-26 10:23:41 -07:00
Phil Wang	bdf5e9c009	todo	2022-04-26 09:56:54 -07:00
Phil Wang	9878be760b	have researcher explicitly state upfront whether to condition with text encodings in cascading ddpm decoder, have DALLE-2 class take care of passing in text if feature turned on	2022-04-26 09:47:09 -07:00
Phil Wang	7ba6357c05	allow for training the Prior network with precomputed CLIP embeddings (or text encodings)	2022-04-26 09:29:51 -07:00
Phil Wang	76e063e8b7	refactor so that the causal transformer in the diffusion prior network can be conditioned without text encodings (for Laions parallel efforts, although it seems from the paper it is needed)	2022-04-26 09:00:11 -07:00
Phil Wang	4d25976f33	make sure non-latent diffusion still works	2022-04-26 08:36:00 -07:00
Phil Wang	0b28ee0d01	revert back to old upsampling, paper does not work	2022-04-26 07:39:04 -07:00
Phil Wang	45262a4bb7	bring in the exponential moving average wrapper, to get ready for training	2022-04-25 19:24:13 -07:00
Phil Wang	13a58a78c4	scratch off todo	2022-04-25 19:01:30 -07:00
Phil Wang	f75d49c781	start a file for all attention-related modules, use attention-based upsampling in the unets in dalle-2	2022-04-25 18:59:10 -07:00
Phil Wang	3b520dfa85	bring in attention-based upsampling to strengthen vqgan-vae, seems to work as advertised in initial experiments in GAN	2022-04-25 17:27:45 -07:00
Phil Wang	79198c6ae4	keep readme simple for reader	2022-04-25 17:21:45 -07:00
Phil Wang	77a246b1b9	todo	2022-04-25 08:48:28 -07:00
Phil Wang	f93a3f6ed8	reprioritize	2022-04-25 08:44:27 -07:00
Phil Wang	8f2a0c7e00	better naming	2022-04-25 07:44:33 -07:00
Phil Wang	863f4ef243	just take care of the logic for setting all latent diffusion to predict x0, if needed	2022-04-24 10:06:42 -07:00
Phil Wang	fb8a66a2de	just in case latent diffusion performs better with prediction of x0 instead of epsilon, open up the research avenue	2022-04-24 10:04:22 -07:00
Phil Wang	579d4b42dd	does not seem right to clip for the prior diffusion part	2022-04-24 09:51:18 -07:00
Phil Wang	473808850a	some outlines to the eventual CLI endpoint	2022-04-24 09:27:15 -07:00
Phil Wang	d5318aef4f	todo	2022-04-23 08:23:08 -07:00
Phil Wang	f82917e1fd	prepare for turning off gradient penalty, as shown in GAN literature, GP needs to be only applied 1 out of 4 iterations	2022-04-23 07:52:10 -07:00
Phil Wang	05b74be69a	use null container pattern to cleanup some conditionals, save more cleanup for next week	2022-04-22 15:23:18 -07:00
Phil Wang	a8b5d5d753	last tweak of readme	2022-04-22 14:16:43 -07:00
Phil Wang	976ef7f87c	project management	2022-04-22 14:15:42 -07:00
Phil Wang	fd175bcc0e	readme	2022-04-22 14:13:33 -07:00
Phil Wang	76b32f18b3	first pass at complete DALL-E2 + Latent Diffusion integration, latent diffusion on any layer(s) of the cascading ddpm in the decoder.	2022-04-22 13:53:13 -07:00
Phil Wang	f2d5b87677	todo	2022-04-22 11:39:58 -07:00
Phil Wang	461347c171	fix vqgan-vae for latent diffusion	2022-04-22 11:38:57 -07:00
Phil Wang	46cef31c86	optional projection out for prior network causal transformer	2022-04-22 11:16:30 -07:00
Phil Wang	59b1a77d4d	be a bit more conservative and stick with layernorm (without bias) for now, given @borisdayma results https://twitter.com/borisdayma/status/1517227191477571585	2022-04-22 11:14:54 -07:00
Phil Wang	7f338319fd	makes more sense for blur augmentation to happen before the upsampling	2022-04-22 11:10:47 -07:00
Phil Wang	2c6c91829d	refactor blurring training augmentation to be taken care of by the decoder, with option to downsample to previous resolution before upsampling (cascading ddpm). this opens up the possibility of cascading latent ddpm	2022-04-22 11:09:17 -07:00
Phil Wang	ad17c69ab6	prepare for latent diffusion in the first DDPM of the cascade in the Decoder	2022-04-21 17:54:31 -07:00
Phil Wang	0b4ec34efb	todo	2022-04-20 12:24:23 -07:00
Phil Wang	f027b82e38	remove wip as main networks (prior and decoder) are completed	2022-04-20 12:12:16 -07:00
Phil Wang	8cc9016cb0	Merge pull request #17 from kashif/patch-2 added diffusion-gan thoughts	2022-04-20 12:10:26 -07:00
Kashif Rasul	1d8f37befe	added diffusion-gan thoughts https://github.com/NVlabs/denoising-diffusion-gan	2022-04-20 21:01:11 +02:00
Phil Wang	faebf4c8b8	from my vision transformer experience, dimension of attention head of 32 is sufficient for image feature maps	2022-04-20 11:40:32 -07:00
Phil Wang	b8e8d3c164	thoughts	2022-04-20 11:34:51 -07:00
Phil Wang	8e2416b49b	commit to generalizing latent diffusion to one model	2022-04-20 11:27:42 -07:00
Phil Wang	f37c26e856	cleanup and DRY a little	2022-04-20 10:56:32 -07:00
Phil Wang	27a33e1b20	complete contextmanager method for keeping only one unet in GPU during training or inference	2022-04-20 10:46:13 -07:00
Phil Wang	6f941a219a	give time tokens a surface area of 2 tokens as default, make it so researcher can customize which unet actually is conditioned on image embeddings and/or text encodings	2022-04-20 10:04:47 -07:00
Phil Wang	ddde8ca1bf	fix cosine bbeta schedule, thanks to @Zhengxinyang	2022-04-19 20:54:28 -07:00
Phil Wang	c26b77ad20	todo	2022-04-19 13:07:32 -07:00
Phil Wang	c5b4aab8e5	intent	2022-04-19 11:00:05 -07:00
Phil Wang	a35c309b5f	add sparse attention layers in between convnext blocks in unet (grid like attention, used in mobilevit, maxvit [bytedance ai], as well as a growing number of attention-based GANs)	2022-04-19 09:49:03 -07:00
Phil Wang	55bdcb98b9	scaffold for latent diffusion	2022-04-19 09:26:58 -07:00
Phil Wang	82328f16cd	same for text encodings for decoder ddpm training	2022-04-18 14:41:02 -07:00
Phil Wang	6fee4fce6e	also allow for image embedding to be passed into the diffusion model, in the case one wants to generate image embedding once and then train multiple unets in one iteration	2022-04-18 14:00:38 -07:00
Phil Wang	a54e309269	prioritize todos, play project management	2022-04-18 13:28:01 -07:00
Phil Wang	c6bfd7fdc8	readme	2022-04-18 12:43:10 -07:00
Phil Wang	960a79857b	use some magic just this once to remove the need for researchers to think	2022-04-18 12:40:43 -07:00
Phil Wang	7214df472d	todo	2022-04-18 12:18:19 -07:00
Phil Wang	00ae50999b	make kernel size and sigma for gaussian blur for cascading DDPM overridable at forward. also make sure unets are wrapped in a modulelist so that at sample time, blurring does not happen	2022-04-18 12:04:31 -07:00
Phil Wang	6cddefad26	readme	2022-04-18 11:52:25 -07:00
Phil Wang	0332eaa6ff	complete first pass at full cascading DDPM setup in Decoder, flexible enough to support one unet for testing	2022-04-18 11:44:56 -07:00
Phil Wang	1cce4225eb	0.0.18	2022-04-17 07:29:34 -07:00
Phil Wang	5ab0700bab	Merge pull request #14 from kashif/loss-schedule added huber loss and other schedulers	2022-04-17 07:29:10 -07:00
Kashif Rasul	b0f2fbaa95	schedule to Prior	2022-04-17 15:21:47 +02:00
Kashif Rasul	51361c2d15	added beta_schedule argument	2022-04-17 15:19:33 +02:00
Kashif Rasul	42d6e47387	added huber loss and other schedulers	2022-04-17 15:14:05 +02:00
Phil Wang	1e939153fb	link to AssemblyAI explanation	2022-04-15 12:58:57 -07:00
Phil Wang	1abeb8918e	personal project management for next week	2022-04-15 08:04:01 -07:00
Phil Wang	b423855483	commit to jax version	2022-04-15 07:16:25 -07:00
Phil Wang	c400d8758c	prepare for cascading diffusion in unet, save the full progressive upsampling architecture to be built next week	2022-04-15 07:03:28 -07:00
Phil Wang	bece206699	fix bug thanks to @jihoonerd	2022-04-15 06:44:40 -07:00
Phil Wang	5b4ee09625	ideation	2022-04-14 13:48:01 -07:00
Phil Wang	6e27f617f1	use t5 relative positional bias in prior network causal transformer, since it makes more sense than rotary embeddings	2022-04-14 12:01:09 -07:00
Phil Wang	9f55c24db6	allow for decoder conditioning with the text encodings from CLIP, if it is passed in. use lazy linear to avoid researchers having to worry about text encoding dimensions, but remove later if it does not work well	2022-04-14 11:46:45 -07:00
Phil Wang	69e822b7f8	"project management"	2022-04-14 10:20:37 -07:00
Phil Wang	23c401a5d5	use the eval decorator	2022-04-14 10:13:43 -07:00
Phil Wang	68e9883f59	use cross attention for conditioning unet based on image embedding tokens (which opens up the door on conditioning on text encodings as well	2022-04-14 10:10:04 -07:00
Phil Wang	95b018374a	start using swish glu everywhere, given success of PaLM	2022-04-14 09:34:32 -07:00
Phil Wang	8b5c2385b0	better naming	2022-04-14 09:24:31 -07:00
Phil Wang	f2c52d8239	fix bug with classifier free guidance for prior network, even though it seems it may not be used	2022-04-14 09:21:51 -07:00
Phil Wang	97e951221b	bring in blur, as it will be used somewhere in the cascading DDPM in the decoder eventually, once i figure it out	2022-04-14 09:16:09 -07:00
Phil Wang	e1b0c140f1	cleanup readme	2022-04-14 08:51:22 -07:00
Phil Wang	5989569a44	link to OpenCLIP effort	2022-04-14 08:31:15 -07:00
Phil Wang	82464d7bd3	per-fect	2022-04-14 08:30:07 -07:00
Phil Wang	7fb3f695d5	offer continuously parameterized time embedding for diffusion prior network, remove a hyperparameter that may trip up people, if not set correctly	2022-04-14 08:28:11 -07:00