make sure classifier free guidance is used only if conditional dropout is present on the DiffusionPrior and Decoder classes. also make sure prior can have a different conditional scale than decoder

LazyLinear is not mature, make users pass in text_embed_dim if text conditioning is turned on
cond drop prob for diffusion prior network should default to 0
2026-02-15 13:14:25 +01:00 · 2022-05-15 19:09:38 -07:00 · 2022-05-15 18:56:52 -07:00 · 2022-05-15 18:47:45 -07:00 · 2022-05-15 17:00:25 -07:00 · 2022-05-15 16:56:53 -07:00
4 changed files with 89 additions and 25 deletions
--- a/README.md
+++ b/README.md
@@ -706,7 +706,7 @@ mock_image_embed = torch.randn(1, 512).cuda()
 images = decoder.sample(mock_image_embed) # (1, 3, 1024, 1024)
 ```
-## Training wrapper (wip)
+## Training wrapper
 ### Decoder Training
@@ -851,6 +851,57 @@ diffusion_prior_trainer.update()  # this will update the optimizer as well as th
 image_embeds = diffusion_prior_trainer.sample(text) # (4, 512) - exponential moving averaged image embeddings
 ```
 ## Bonus
 ### Unconditional Training
 The repository also contains the means to train unconditional DDPM model, or even cascading DDPMs. You simply have to set `unconditional = True` in the `Decoder`
 ex.
 ```python
 import torch
 from dalle2_pytorch import Unet, Decoder
 # unet for the cascading ddpm
 unet1 = Unet(
    dim = 128,
    dim_mults=(1, 2, 4, 8)
 ).cuda()
 unet2 = Unet(
    dim = 32,
    dim_mults = (1, 2, 4, 8, 16)
 ).cuda()
 # decoder, which contains the unets
 decoder = Decoder(
    unet = (unet1, unet2),
    image_sizes = (256, 512),  # first unet up to 256px, then second to 512px
    timesteps = 1000,
    unconditional = True
 ).cuda()
 # mock images (get a lot of this)
 images = torch.randn(1, 3, 512, 512).cuda()
 # feed images into decoder
 for i in (1, 2):
    loss = decoder(images, unet_number = i)
    loss.backward()
 # do the above for many many many many steps
 # then it will learn to generate images
 images = decoder.sample(batch_size = 2) # (2, 3, 512, 512)
 ```
 ## Dataloaders
 ### Decoder Dataloaders
 In order to make loading data simple and efficient, we include some general dataloaders that can be used to train portions of the network.
@@ -1014,6 +1065,7 @@ Once built, images will be saved to the same directory the command is invoked
 - [ ] allow for unet to be able to condition non-cross attention style as well
 - [ ] for all model classes with hyperparameters that changes the network architecture, make it requirement that they must expose a config property, and write a simple function that asserts that it restores the object correctly
 - [ ] for both diffusion prior and decoder, all exponential moving averaged models needs to be saved and restored as well (as well as the step number)
 - [ ] read the paper, figure it out, and build it https://github.com/lucidrains/DALLE2-pytorch/issues/89
 ## Citations
@@ -1102,4 +1154,13 @@ Once built, images will be saved to the same directory the command is invoked
 }
 ```
 ```bibtex
@article{ho2021cascaded,
    title   = {Cascaded Diffusion Models for High Fidelity Image Generation},
    author  = {Ho, Jonathan and Saharia, Chitwan and Chan, William and Fleet, David J and Norouzi, Mohammad and Salimans, Tim},
    journal = {arXiv preprint arXiv:2106.15282},
    year    = {2021}
 }
 ```
 *Creating noise from data is easy; creating data from noise is generative modeling.* - <a href="https://arxiv.org/abs/2011.13456">Yang Song's paper</a>
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -794,7 +794,7 @@ class DiffusionPriorNetwork(nn.Module):
        text_embed,
        text_encodings = None,
        mask = None,
-        cond_drop_prob = 0.2
+        cond_drop_prob = 0.
    ):
        batch, dim, device, dtype = *image_embed.shape, image_embed.device, image_embed.dtype
@@ -901,6 +901,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        self.channels = default(image_channels, lambda: clip.image_channels)
        self.cond_drop_prob = cond_drop_prob
        self.can_classifier_guidance = cond_drop_prob > 0.
        self.condition_on_text_encodings = condition_on_text_encodings
        # in paper, they do not predict the noise, but predict x0 directly for image embedding, claiming empirically better results. I'll just offer both.
@@ -914,8 +915,10 @@ class DiffusionPrior(BaseGaussianDiffusion):
        self.training_clamp_l2norm = training_clamp_l2norm
        self.init_image_embed_l2norm = init_image_embed_l2norm
-    def p_mean_variance(self, x, t, text_cond, clip_denoised: bool):
+    def p_mean_variance(self, x, t, text_cond, clip_denoised = False, cond_scale = 1.):
-        pred = self.net(x, t, **text_cond)
+        assert not (cond_scale != 1. and not self.can_classifier_guidance), 'the model was not trained with conditional dropout, and thus one cannot use classifier free guidance (cond_scale anything other than 1)'
        pred = self.net.forward_with_cond_scale(x, t, cond_scale = cond_scale, **text_cond)
        if self.predict_x_start:
            x_recon = pred
@@ -934,16 +937,16 @@ class DiffusionPrior(BaseGaussianDiffusion):
        return model_mean, posterior_variance, posterior_log_variance
    @torch.inference_mode()
-    def p_sample(self, x, t, text_cond = None, clip_denoised = True, repeat_noise = False):
+    def p_sample(self, x, t, text_cond = None, clip_denoised = True, repeat_noise = False, cond_scale = 1.):
        b, *_, device = *x.shape, x.device
-        model_mean, _, model_log_variance = self.p_mean_variance(x = x, t = t, text_cond = text_cond, clip_denoised = clip_denoised)
+        model_mean, _, model_log_variance = self.p_mean_variance(x = x, t = t, text_cond = text_cond, clip_denoised = clip_denoised, cond_scale = cond_scale)
        noise = noise_like(x.shape, device, repeat_noise)
        # no noise when t == 0
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
    @torch.inference_mode()
-    def p_sample_loop(self, shape, text_cond):
+    def p_sample_loop(self, shape, text_cond, cond_scale = 1.):
        device = self.betas.device
        b = shape[0]
@@ -954,7 +957,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        for i in tqdm(reversed(range(0, self.num_timesteps)), desc='sampling loop time step', total=self.num_timesteps):
            times = torch.full((b,), i, device = device, dtype = torch.long)
-            image_embed = self.p_sample(image_embed, times, text_cond = text_cond)
+            image_embed = self.p_sample(image_embed, times, text_cond = text_cond, cond_scale = cond_scale)
        return image_embed
@@ -980,19 +983,19 @@ class DiffusionPrior(BaseGaussianDiffusion):
    @torch.inference_mode()
    @eval_decorator
-    def sample_batch_size(self, batch_size, text_cond):
+    def sample_batch_size(self, batch_size, text_cond, cond_scale = 1.):
        device = self.betas.device
        shape = (batch_size, self.image_embed_dim)
        img = torch.randn(shape, device = device)
        for i in tqdm(reversed(range(0, self.num_timesteps)), desc = 'sampling loop time step', total = self.num_timesteps):
-            img = self.p_sample(img, torch.full((batch_size,), i, device = device, dtype = torch.long), text_cond = text_cond)
+            img = self.p_sample(img, torch.full((batch_size,), i, device = device, dtype = torch.long), text_cond = text_cond, cond_scale = cond_scale)
        return img
    @torch.inference_mode()
    @eval_decorator
-    def sample(self, text, num_samples_per_batch = 2):
+    def sample(self, text, num_samples_per_batch = 2, cond_scale = 1.):
        # in the paper, what they did was
        # sample 2 image embeddings, choose the top 1 similarity, as judged by CLIP
        text = repeat(text, 'b ... -> (b r) ...', r = num_samples_per_batch)
@@ -1007,7 +1010,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        if self.condition_on_text_encodings:
            text_cond = {**text_cond, 'text_encodings': text_encodings, 'mask': text_mask}
-        image_embeds = self.p_sample_loop((batch_size, image_embed_dim), text_cond = text_cond)
+        image_embeds = self.p_sample_loop((batch_size, image_embed_dim), text_cond = text_cond, cond_scale = cond_scale)
        # retrieve original unscaled image embed
@@ -1305,7 +1308,7 @@ class Unet(nn.Module):
        self,
        dim,
        *,
-        image_embed_dim,
+        image_embed_dim = None,
        text_embed_dim = None,
        cond_dim = None,
        num_image_tokens = 4,
@@ -1377,7 +1380,7 @@ class Unet(nn.Module):
        self.image_to_cond = nn.Sequential(
            nn.Linear(image_embed_dim, cond_dim * num_image_tokens),
            Rearrange('b (n d) -> b n d', n = num_image_tokens)
-        ) if image_embed_dim != cond_dim else nn.Identity()
+        ) if cond_on_image_embeds and image_embed_dim != cond_dim else nn.Identity()
        self.norm_cond = nn.LayerNorm(cond_dim)
        self.norm_mid_cond = nn.LayerNorm(cond_dim)
@@ -1387,7 +1390,8 @@ class Unet(nn.Module):
        self.text_to_cond = None
        if cond_on_text_encodings:
-            self.text_to_cond = nn.LazyLinear(cond_dim) if not exists(text_embed_dim) else nn.Linear(text_embed_dim, cond_dim)
+            assert exists(text_embed_dim), 'text_embed_dim must be given to the unet if cond_on_text_encodings is True'
            self.text_to_cond = nn.Linear(text_embed_dim, cond_dim)
        # finer control over whether to condition on image embeddings and text encodings
        # so one can have the latter unets in the cascading DDPMs only focus on super-resoluting
@@ -1701,7 +1705,7 @@ class Decoder(BaseGaussianDiffusion):
        self.unconditional = unconditional
        assert not (condition_on_text_encodings and unconditional), 'unconditional decoder image generation cannot be set to True if conditioning on text is present'
-        assert exists(clip) ^ exists(image_size), 'either CLIP is supplied, or you must give the image_size and channels (usually 3 for RGB)'
+        assert self.unconditional or (exists(clip) ^ exists(image_size)), 'either CLIP is supplied, or you must give the image_size and channels (usually 3 for RGB)'
        self.clip = None
        if exists(clip):
@@ -1792,6 +1796,7 @@ class Decoder(BaseGaussianDiffusion):
        self.image_cond_drop_prob = image_cond_drop_prob
        self.text_cond_drop_prob = text_cond_drop_prob
        self.can_classifier_guidance = image_cond_drop_prob > 0. or text_cond_drop_prob > 0.
        # whether to clip when sampling
@@ -1818,6 +1823,8 @@ class Decoder(BaseGaussianDiffusion):
        unet.cpu()
    def p_mean_variance(self, unet, x, t, image_embed, text_encodings = None, text_mask = None, lowres_cond_img = None, clip_denoised = True, predict_x_start = False, learned_variance = False, cond_scale = 1., model_output = None):
        assert not (cond_scale != 1. and not self.can_classifier_guidance), 'the decoder was not trained with conditional dropout, and thus one cannot use classifier free guidance (cond_scale anything other than 1)'
        pred = default(model_output, lambda: unet.forward_with_cond_scale(x, t, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = cond_scale, lowres_cond_img = lowres_cond_img))
        if learned_variance:
@@ -2036,12 +2043,12 @@ class Decoder(BaseGaussianDiffusion):
        times = torch.randint(0, self.num_timesteps, (b,), device = device, dtype = torch.long)
-        if not exists(image_embed):
+        if not exists(image_embed) and not self.unconditional:
            assert exists(self.clip), 'if you want to derive CLIP image embeddings automatically, you must supply `clip` to the decoder on init'
            image_embed, _ = self.clip.embed_image(image)
        text_encodings = text_mask = None
-        if exists(text) and not exists(text_encodings):
+        if exists(text) and not exists(text_encodings) and not self.unconditional:
            assert exists(self.clip), 'if you are passing in raw text, you need to supply `clip` to the decoder'
            _, text_encodings, text_mask = self.clip.embed_text(text)
@@ -2093,6 +2100,7 @@ class DALLE2(nn.Module):
        self,
        text,
        cond_scale = 1.,
        prior_cond_scale = 1.,
        return_pil_images = False
    ):
        device = next(self.parameters()).device
@@ -2102,7 +2110,7 @@ class DALLE2(nn.Module):
            text = [text] if not isinstance(text, (list, tuple)) else text
            text = tokenizer.tokenize(text).to(device)
-        image_embed = self.prior.sample(text, num_samples_per_batch = self.prior_num_samples)
+        image_embed = self.prior.sample(text, num_samples_per_batch = self.prior_num_samples, cond_scale = prior_cond_scale)
        text_cond = text if self.decoder_need_text_cond else None
        images = self.decoder.sample(image_embed, text = text_cond, cond_scale = cond_scale)
--- a/dalle2_pytorch/train.py
+++ b/dalle2_pytorch/train.py
@@ -335,11 +335,6 @@ class DecoderTrainer(nn.Module):
        self.num_unets = len(self.decoder.unets)
        self.use_ema = use_ema
        if use_ema:
            has_lazy_linear = any([type(module) == nn.LazyLinear for module in decoder.modules()])
            assert not has_lazy_linear, 'you must set the text_embed_dim on your u-nets if you plan on doing automatic exponential moving average'
        self.ema_unets = nn.ModuleList([])
        self.amp = amp
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.2.34',
+  version = '0.2.38',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
Author	SHA1	Message	Date
Phil Wang	ecf9e8027d	make sure classifier free guidance is used only if conditional dropout is present on the DiffusionPrior and Decoder classes. also make sure prior can have a different conditional scale than decoder	2022-05-15 19:09:38 -07:00
Phil Wang	36c5079bd7	LazyLinear is not mature, make users pass in text_embed_dim if text conditioning is turned on	2022-05-15 18:56:52 -07:00
Phil Wang	4a4c7ac9e6	cond drop prob for diffusion prior network should default to 0	2022-05-15 18:47:45 -07:00
Phil Wang	fad7481479	todo	2022-05-15 17:00:25 -07:00
Phil Wang	123658d082	cite Ho et al, since cascading ddpm is now trainable	2022-05-15 16:56:53 -07:00
Phil Wang	11d4e11f10	allow for training unconditional ddpm or cascading ddpms	2022-05-15 16:54:56 -07:00