add diffusion prior trainer, which automatically takes care of the exponential moving average (training and sampling), as well as mixed precision, gradient clipping

fix training with clip
remove l2norm output from train_diffusion_prior.py
2026-02-12 19:44:26 +01:00 · 2022-05-06 08:06:28 -07:00 · 2022-05-06 07:37:57 -07:00 · 2022-05-05 19:07:58 -07:00 · 2022-05-05 08:11:01 -07:00 · 2022-05-05 07:54:12 -07:00
6 changed files with 265 additions and 65 deletions
--- a/README.md
+++ b/README.md
@@ -587,47 +587,6 @@ images = dalle2(

 Now you'll just have to worry about training the Prior and the Decoder!

-## Dataloaders
-In order to make loading data simple and efficient, we include some general dataloaders that can be used to train portions of the network.
-
-### Decoder: Image Embedding Dataset
-When training the decoder (and up samplers if training together) in isolation, you will need to load images and corresponding image embeddings. This dataset can read two similar types of datasets. First, it can read a [webdataset](https://github.com/webdataset/webdataset) that contains `.jpg` and `.npy` files in the `.tar`s that contain the images and associated image embeddings respectively. Alternatively, you can also specify a source for the embeddings outside of the webdataset. In this case, the path to the embeddings should contain `.npy` files with the same shard numbers as the webdataset and there should be a correspondence between the filename of the `.jpg` and the index of the embedding in the `.npy`. So, for example, `0001.tar` from the webdataset with image `00010509.jpg` (the first 4 digits are the shard number and the last 4 are the index) in it should be paralleled by a `img_emb_0001.npy` which contains a NumPy array with the embedding at index 509.
-
-Generating a dataset of this type: 
-1. Use [img2dataset](https://github.com/rom1504/img2dataset) to generate a webdataset.
-2. Use [clip-retrieval](https://github.com/rom1504/clip-retrieval) to convert the images to embeddings.
-3. Use [embedding-dataset-reordering](https://github.com/Veldrovive/embedding-dataset-reordering) to reorder the embeddings into the expected format.
-
-Usage:
-```python
-from dalle2_pytorch.dataloaders import ImageEmbeddingDataset, create_image_embedding_dataloader
-
-# Create a dataloader directly.
-dataloader = create_image_embedding_dataloader(
-    tar_url="/path/or/url/to/webdataset/{0000..9999}.tar", # Uses braket expanding notation. This specifies to read all tars from 0000.tar to 9999.tar
-    embeddings_url="path/or/url/to/embeddings/folder",     # Included if .npy files are not in webdataset. Left out or set to None otherwise
-    num_workers=4,
-    batch_size=32,
-    shard_width=4,                                         # If a file in the webdataset shard 3 is named 0003039.jpg, we know the shard width is 4 and the last three digits are the index
-    shuffle_num=200,                                       # Does a shuffle of the data with a buffer size of 200
-    shuffle_shards=True,                                   # Shuffle the order the shards are read in
-    resample_shards=False,                                 # Sample shards with replacement. If true, an epoch will be infinite unless stopped manually
-)
-for img, emb in dataloader:
-    print(img.shape)  # torch.Size([32, 3, 256, 256])
-    print(emb.shape)  # torch.Size([32, 512])
-    # Train decoder only as shown above
-
-# Or create a dataset without a loader so you can configure it manually
-dataset = ImageEmbeddingDataset(
-    urls="/path/or/url/to/webdataset/{0000..9999}.tar",
-    embedding_folder_url="path/or/url/to/embeddings/folder",
-    shard_width=4,
-    shuffle_shards=True,
-    resample=False
-)
-```
-
 ## Experimental

 ### DALL-E2 with Latent Diffusion
@@ -827,6 +786,149 @@ mock_image_embed = torch.randn(4, 512).cuda()
 images = decoder_trainer.sample(mock_image_embed, text = text) # (4, 3, 256, 256)
 ```

+### Diffusion Prior Training
+
+Similarly, one can use the `DiffusionPriorTrainer` to automatically instantiate and keep track of an exponential moving averaged prior.
+
+```python
+import torch
+from dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, DiffusionPriorTrainer, Unet, Decoder, CLIP
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 6,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 6,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8
+).cuda()
+
+# mock data
+
+text = torch.randint(0, 49408, (4, 256)).cuda()
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# prior networks (with transformer)
+
+prior_network = DiffusionPriorNetwork(
+    dim = 512,
+    depth = 6,
+    dim_head = 64,
+    heads = 8
+).cuda()
+
+diffusion_prior = DiffusionPrior(
+    net = prior_network,
+    clip = clip,
+    timesteps = 100,
+    cond_drop_prob = 0.2
+).cuda()
+
+diffusion_prior_trainer = DiffusionPriorTrainer(
+    diffusion_prior,
+    lr = 3e-4,
+    wd = 1e-2,
+    ema_beta = 0.99,
+    ema_update_after_step = 1000,
+    ema_update_every = 10,
+)
+
+loss = diffusion_prior_trainer(text, images)
+loss.backward()
+diffusion_prior_trainer.update()  # this will update the optimizer as well as the exponential moving averaged diffusion prior
+
+# after much of the above three lines in a loop
+# you can sample from the exponential moving average of the diffusion prior identically to how you do so for DiffusionPrior
+
+image_embeds = diffusion_prior_trainer.sample(text) # (4, 512) - exponential moving averaged image embeddings
+```
+
+### Decoder Dataloaders
+
+In order to make loading data simple and efficient, we include some general dataloaders that can be used to train portions of the network.
+
+#### Decoder: Image Embedding Dataset
+
+When training the decoder (and up samplers if training together) in isolation, you will need to load images and corresponding image embeddings. This dataset can read two similar types of datasets. First, it can read a [webdataset](https://github.com/webdataset/webdataset) that contains `.jpg` and `.npy` files in the `.tar`s that contain the images and associated image embeddings respectively. Alternatively, you can also specify a source for the embeddings outside of the webdataset. In this case, the path to the embeddings should contain `.npy` files with the same shard numbers as the webdataset and there should be a correspondence between the filename of the `.jpg` and the index of the embedding in the `.npy`. So, for example, `0001.tar` from the webdataset with image `00010509.jpg` (the first 4 digits are the shard number and the last 4 are the index) in it should be paralleled by a `img_emb_0001.npy` which contains a NumPy array with the embedding at index 509.
+
+Generating a dataset of this type: 
+1. Use [img2dataset](https://github.com/rom1504/img2dataset) to generate a webdataset.
+2. Use [clip-retrieval](https://github.com/rom1504/clip-retrieval) to convert the images to embeddings.
+3. Use [embedding-dataset-reordering](https://github.com/Veldrovive/embedding-dataset-reordering) to reorder the embeddings into the expected format.
+
+Usage:
+
+```python
+from dalle2_pytorch.dataloaders import ImageEmbeddingDataset, create_image_embedding_dataloader
+
+# Create a dataloader directly.
+dataloader = create_image_embedding_dataloader(
+    tar_url="/path/or/url/to/webdataset/{0000..9999}.tar", # Uses braket expanding notation. This specifies to read all tars from 0000.tar to 9999.tar
+    embeddings_url="path/or/url/to/embeddings/folder",     # Included if .npy files are not in webdataset. Left out or set to None otherwise
+    num_workers=4,
+    batch_size=32,
+    shard_width=4,                                         # If a file in the webdataset shard 3 is named 0003039.jpg, we know the shard width is 4 and the last three digits are the index
+    shuffle_num=200,                                       # Does a shuffle of the data with a buffer size of 200
+    shuffle_shards=True,                                   # Shuffle the order the shards are read in
+    resample_shards=False,                                 # Sample shards with replacement. If true, an epoch will be infinite unless stopped manually
+)
+for img, emb in dataloader:
+    print(img.shape)  # torch.Size([32, 3, 256, 256])
+    print(emb.shape)  # torch.Size([32, 512])
+    # Train decoder only as shown above
+
+# Or create a dataset without a loader so you can configure it manually
+dataset = ImageEmbeddingDataset(
+    urls="/path/or/url/to/webdataset/{0000..9999}.tar",
+    embedding_folder_url="path/or/url/to/embeddings/folder",
+    shard_width=4,
+    shuffle_shards=True,
+    resample=False
+)
+```
+
+## Scripts
+
+### Using the `train_diffusion_prior.py` script
+
+This script allows training the DiffusionPrior on pre-computed text and image embeddings. The working example below elucidates this process.
+Please note that the script internally passes text_embed and image_embed to the DiffusionPrior, unlike the example below.
+
+### Usage 
+
+```bash
+$ pyhon train_diffusion_prior.py
+```
+
+The most significant parameters for the script are as follows:
+
+--image-embed-url, default = "https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/")
+
+--text-embed-url, default = "https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/")
+
+--image-embed-dim, default=768 - 768 corresponds to the ViT iL/14 embedding size,change it to what your chosen ViT generates
+
+--learning-rate, default=1.1e-4
+
+--weight-decay,  default=6.02e-2
+
+--max-grad-norm, default=0.5
+
+--batch-size, default=10 ** 4
+
+--num-epochs, default=5
+
+--clip, default=None # Signals the prior to use pre-computed embeddings
+
+### Sample wandb run log
+
+Please find a sample wandb run log at : https://wandb.ai/laion/diffusion-prior/runs/aul0rhv5?workspace=
+
 ## CLI (wip)

 ```bash
@@ -864,7 +966,7 @@ Once built, images will be saved to the same directory the command is invoked
 - [x] add convnext backbone for vqgan-vae (in addition to vit [vit-vqgan] + resnet)
 - [x] make sure DDPMs can be run with traditional resnet blocks (but leave convnext as an option for experimentation)
 - [x] make sure for the latter unets in the cascade, one can train on crops for learning super resolution (constrain the unet to be only convolutions in that case, or allow conv-like attention with rel pos bias)
- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo)
+- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo) - consider https://github.com/lucidrains/uformer-pytorch attention-based unet
 - [ ] copy the cascading ddpm code to a separate repo (perhaps https://github.com/lucidrains/denoising-diffusion-pytorch) as the main contribution of dalle2 really is just the prior network
 - [ ] transcribe code to Jax, which lowers the activation energy for distributed training, given access to TPUs
 - [ ] pull logic for training diffusion prior into a class DiffusionPriorTrainer, for eventual script based + CLI based training
@@ -877,7 +979,7 @@ Once built, images will be saved to the same directory the command is invoked
 - [ ] use an experimental tracker agnostic setup, as done <a href="https://github.com/lucidrains/tf-bind-transformer#simple-trainer-class-for-fine-tuning">here</a>
 - [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2
 - [ ] make sure FILIP works with DALL-E2 from x-clip https://arxiv.org/abs/2111.07783
- [ ] make sure resnet | convnext block hyperparameters can be configurable across unet depth (groups and expansion factor)
+- [ ] make sure resnet hyperparameters can be configurable across unet depth (groups and expansion factor)

 ## Citations

--- a/dalle2_pytorch/init.py
+++ b/dalle2_pytorch/init.py
@@ -1,6 +1,6 @@
 from dalle2_pytorch.dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
 from dalle2_pytorch.dalle2_pytorch import OpenAIClipAdapter
-from dalle2_pytorch.train import DecoderTrainer
+from dalle2_pytorch.train import DecoderTrainer, DiffusionPriorTrainer

 from dalle2_pytorch.vqgan_vae import VQGanVAE
 from x_clip import CLIP
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -652,14 +652,12 @@ class DiffusionPriorNetwork(nn.Module):
        self,
        dim,
        num_timesteps = None,
-        l2norm_output = False,  # whether to restrict image embedding output with l2norm at the end (may make it easier to learn?)
        **kwargs
    ):
        super().__init__()
        self.time_embeddings = nn.Embedding(num_timesteps, dim) if exists(num_timesteps) else nn.Sequential(Rearrange('b -> b 1'), MLP(1, dim)) # also offer a continuous version of timestep embeddings, with a 2 layer MLP
        self.learned_query = nn.Parameter(torch.randn(dim))
        self.causal_transformer = CausalTransformer(dim = dim, **kwargs)
-        self.l2norm_output = l2norm_output

    def forward_with_cond_scale(
        self,
@@ -738,8 +736,7 @@ class DiffusionPriorNetwork(nn.Module):

        pred_image_embed = tokens[..., -1, :]

-        output_fn = l2norm if self.l2norm_output else identity
-        return output_fn(pred_image_embed)
+        return pred_image_embed

 class DiffusionPrior(BaseGaussianDiffusion):
    def __init__(
@@ -787,7 +784,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        self.predict_x_start = predict_x_start

        # @crowsonkb 's suggestion - https://github.com/lucidrains/DALLE2-pytorch/issues/60#issue-1226116132
-        self.image_embed_scale = default(image_embed_scale, image_embed_dim ** 0.5)
+        self.image_embed_scale = default(image_embed_scale, self.image_embed_dim ** 0.5)

        # whether to force an l2norm, similar to clipping denoised, when sampling
        self.sampling_clamp_l2norm = sampling_clamp_l2norm
@@ -848,6 +845,18 @@ class DiffusionPrior(BaseGaussianDiffusion):
        loss = self.loss_fn(pred, target)
        return loss

+    @torch.inference_mode()
+    @eval_decorator
+    def sample_batch_size(self, batch_size, text_cond):
+        device = self.betas.device
+        shape = (batch_size, self.image_embed_dim)
+
+        img = torch.randn(shape, device = device)
+
+        for i in tqdm(reversed(range(0, self.num_timesteps)), desc = 'sampling loop time step', total = self.num_timesteps):
+            img = self.p_sample(img, torch.full((batch_size,), i, device = device, dtype = torch.long), text_cond = text_cond)
+        return img
+
    @torch.inference_mode()
    @eval_decorator
    def sample(self, text, num_samples_per_batch = 2):
@@ -1460,7 +1469,9 @@ class Decoder(BaseGaussianDiffusion):
        self,
        unet,
        *,
-        clip,
+        clip = None,
+        image_size = None,
+        channels = 3,
        vae = tuple(),
        timesteps = 1000,
        image_cond_drop_prob = 0.1,
@@ -1484,15 +1495,22 @@ class Decoder(BaseGaussianDiffusion):
            loss_type = loss_type
        )

-        if isinstance(clip, CLIP):
-            clip = XClipAdapter(clip)
+        assert exists(clip) ^ exists(image_size), 'either CLIP is supplied, or you must give the image_size and channels (usually 3 for RGB)'

-        freeze_model_and_make_eval_(clip)
-        assert isinstance(clip, BaseClipAdapter)
+        self.clip = None
+        if exists(clip):
+            if isinstance(clip, CLIP):
+                clip = XClipAdapter(clip)

-        self.clip = clip
-        self.clip_image_size = clip.image_size
-        self.channels = clip.image_channels
+            freeze_model_and_make_eval_(clip)
+            assert isinstance(clip, BaseClipAdapter)
+
+            self.clip = clip
+            self.clip_image_size = clip.image_size
+            self.channels = clip.image_channels
+        else:
+            self.clip_image_size = image_size
+            self.channels = channels

        self.condition_on_text_encodings = condition_on_text_encodings

@@ -1525,7 +1543,7 @@ class Decoder(BaseGaussianDiffusion):

        # unet image sizes

-        image_sizes = default(image_sizes, (clip.image_size,))
+        image_sizes = default(image_sizes, (self.clip_image_size,))
        image_sizes = tuple(sorted(set(image_sizes)))

        assert len(self.unets) == len(image_sizes), f'you did not supply the correct number of u-nets ({len(self.unets)}) for resolutions {image_sizes}'
@@ -1730,10 +1748,12 @@ class Decoder(BaseGaussianDiffusion):
        times = torch.randint(0, self.num_timesteps, (b,), device = device, dtype = torch.long)

        if not exists(image_embed):
+            assert exists(self.clip), 'if you want to derive CLIP image embeddings automatically, you must supply `clip` to the decoder on init'
            image_embed, _ = self.clip.embed_image(image)

        text_encodings = text_mask = None
        if exists(text) and not exists(text_encodings):
+            assert exists(self.clip), 'if you are passing in raw text, you need to supply `clip` to the decoder'
            _, text_encodings, text_mask = self.clip.embed_text(text)

        assert not (self.condition_on_text_encodings and not exists(text_encodings)), 'text or text encodings must be passed into decoder if specified'
--- a/dalle2_pytorch/train.py
+++ b/dalle2_pytorch/train.py
@@ -5,7 +5,7 @@ import torch
 from torch import nn
 from torch.cuda.amp import autocast, GradScaler

-from dalle2_pytorch.dalle2_pytorch import Decoder
+from dalle2_pytorch.dalle2_pytorch import Decoder, DiffusionPrior
 from dalle2_pytorch.optimizer import get_optimizer

 # helper functions
@@ -89,7 +89,88 @@ class EMA(nn.Module):
    def __call__(self, *args, **kwargs):
        return self.ema_model(*args, **kwargs)

-# trainers
+# diffusion prior trainer
+
+class DiffusionPriorTrainer(nn.Module):
+    def __init__(
+        self,
+        diffusion_prior,
+        use_ema = True,
+        lr = 3e-4,
+        wd = 1e-2,
+        max_grad_norm = None,
+        amp = False,
+        **kwargs
+    ):
+        super().__init__()
+        assert isinstance(diffusion_prior, DiffusionPrior)
+        ema_kwargs, kwargs = groupby_prefix_and_trim('ema_', kwargs)
+
+        self.diffusion_prior = diffusion_prior
+
+        # exponential moving average
+
+        self.use_ema = use_ema
+
+        if use_ema:
+            has_lazy_linear = any([type(module) == nn.LazyLinear for module in diffusion_prior.modules()])
+            assert not has_lazy_linear, 'you must set the text_embed_dim on your u-nets if you plan on doing automatic exponential moving average'
+
+        if self.use_ema:
+            self.ema_diffusion_prior = EMA(diffusion_prior, **ema_kwargs)
+
+        # optimizer and mixed precision stuff
+
+        self.amp = amp
+
+        self.scaler = GradScaler(enabled = amp)
+
+        self.optimizer = get_optimizer(
+            diffusion_prior.parameters(),
+            lr = lr,
+            wd = wd,
+            **kwargs
+        )
+
+        # gradient clipping if needed
+
+        self.max_grad_norm = max_grad_norm
+
+    def update(self):
+        if exists(self.max_grad_norm):
+            self.scaler.unscale_(self.optimizer)
+            nn.utils.clip_grad_norm_(self.diffusion_prior.parameters(), self.max_grad_norm)
+
+        self.scaler.step(self.optimizer)
+        self.scaler.update()
+        self.optimizer.zero_grad()
+
+        if self.use_ema:
+            self.ema_diffusion_prior.update()
+
+    @torch.inference_mode()
+    def p_sample_loop(self, *args, **kwargs):
+        return self.ema_diffusion_prior.ema_model.p_sample_loop(*args, **kwargs)
+
+    @torch.inference_mode()
+    def sample(self, *args, **kwargs):
+        return self.ema_diffusion_prior.ema_model.sample(*args, **kwargs)
+
+    @torch.inference_mode()
+    def sample_batch_size(self, *args, **kwargs):
+        return self.ema_diffusion_prior.ema_model.sample_batch_size(*args, **kwargs)
+
+    def forward(
+        self,
+        *args,
+        divisor = 1,
+        **kwargs
+    ):
+        with autocast(enabled = self.amp):
+            loss = self.diffusion_prior(*args, **kwargs)
+        return self.scaler.scale(loss / divisor)
+
+# decoder trainer

 class DecoderTrainer(nn.Module):
    def __init__(
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.0.104',
+  version = '0.0.108',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -85,7 +85,6 @@ def train(image_embed_dim,
          clip,
          dp_condition_on_text_encodings,
          dp_timesteps,
-          dp_l2norm_output,
          dp_normformer,
          dp_cond_drop_prob,
          dpn_depth,
@@ -105,8 +104,7 @@ def train(image_embed_dim,
            depth = dpn_depth, 
            dim_head = dpn_dim_head, 
            heads = dpn_heads,
-            normformer = dp_normformer,
-            l2norm_output = dp_l2norm_output).to(device)
+            normformer = dp_normformer).to(device)
    
    # DiffusionPrior with text embeddings and image embeddings pre-computed
    diffusion_prior = DiffusionPrior( 
@@ -273,7 +271,6 @@ def main():
          args.clip,
          args.dp_condition_on_text_encodings,
          args.dp_timesteps,
-          args.dp_l2norm_output,
          args.dp_normformer,
          args.dp_cond_drop_prob,
          args.dpn_depth,
Author	SHA1	Message	Date
Phil Wang	740d644050	add diffusion prior trainer, which automatically takes care of the exponential moving average (training and sampling), as well as mixed precision, gradient clipping	2022-05-06 08:06:28 -07:00
Phil Wang	878b555ef7	fix training with clip	2022-05-06 07:37:57 -07:00
Phil Wang	63029f7388	remove l2norm output from train_diffusion_prior.py	2022-05-05 19:07:58 -07:00
Phil Wang	c76a964fd6	allow for CLIP to be optional in Decoder, and allow DecoderTrainer to work off training pre-encoded image embeddings	2022-05-05 08:11:01 -07:00
Phil Wang	79fabc4341	reorg readme	2022-05-05 07:54:12 -07:00
Kumar R	f7ef4bde38	Added some documentation for the diffusion prior in README.md (#62 ) * Delete README.md * Create README.md * Update README.md * Update README.md	2022-05-05 07:51:31 -07:00
Phil Wang	93ba019069	product management	2022-05-05 07:39:51 -07:00
Phil Wang	8518684ae9	does not make much sense, as researchers may want to try predicting noise with diffusionprior instead of predicting x0	2022-05-05 07:37:00 -07:00