turn off classifier free guidance if predicting x_start for diffusion prior

todo
2026-02-12 19:44:26 +01:00 · 2022-05-07 09:38:17 -07:00 · 2022-05-07 09:21:08 -07:00 · 2022-05-07 08:33:45 -07:00 · 2022-05-07 08:32:43 -07:00
3 changed files with 6 additions and 4 deletions
--- a/README.md
+++ b/README.md
@@ -981,6 +981,8 @@ Once built, images will be saved to the same directory the command is invoked
 - [ ] make sure FILIP works with DALL-E2 from x-clip https://arxiv.org/abs/2111.07783
 - [ ] make sure resnet hyperparameters can be configurable across unet depth (groups and expansion factor)
 - [ ] offer save / load methods on the trainer classes to automatically take care of state dicts for scalers / optimizers / saving versions and checking for breaking changes
+- [ ] offer setting in diffusion prior to split time and image embeddings into multiple tokens, configurable, for more surface area during attention
+- [ ] bring in skip-layer excitatons (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training

 ## Citations

--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -706,7 +706,7 @@ class DiffusionPriorNetwork(nn.Module):
        **kwargs
    ):
        super().__init__()
-        self.time_embeddings = nn.Embedding(num_timesteps, dim) if exists(num_timesteps) else nn.Sequential(Rearrange('b -> b 1'), MLP(1, dim)) # also offer a continuous version of timestep embeddings, with a 2 layer MLP
+        self.time_embeddings = nn.Embedding(num_timesteps, dim) if exists(num_timesteps) else nn.Sequential(SinusoidalPosEmb(dim), MLP(dim, dim)) # also offer a continuous version of timestep embeddings, with a 2 layer MLP
        self.learned_query = nn.Parameter(torch.randn(dim))
        self.causal_transformer = CausalTransformer(dim = dim, **kwargs)

@@ -800,7 +800,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        image_size = None,
        image_channels = 3,
        timesteps = 1000,
-        cond_drop_prob = 0.2,
+        cond_drop_prob = 0.,
        loss_type = "l1",
        predict_x_start = True,
        beta_schedule = "cosine",
@@ -834,7 +834,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        self.image_embed_dim = default(image_embed_dim, lambda: clip.dim_latent)
        self.channels = default(image_channels, lambda: clip.image_channels)

-        self.cond_drop_prob = cond_drop_prob
+        self.cond_drop_prob = cond_drop_prob if not predict_x_start else 0.
        self.condition_on_text_encodings = condition_on_text_encodings

        # in paper, they do not predict the noise, but predict x0 directly for image embedding, claiming empirically better results. I'll just offer both.
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.1.7',
+  version = '0.1.9',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
Author	SHA1	Message	Date
Phil Wang	4010aec033	turn off classifier free guidance if predicting x_start for diffusion prior	2022-05-07 09:38:17 -07:00
Phil Wang	c87b84a259	todo	2022-05-07 09:21:08 -07:00
Phil Wang	8b05468653	todo	2022-05-07 08:33:45 -07:00
Phil Wang	830afd3c15	sinusoidal embed time embeddings for diffusion prior as well, for continuous version	2022-05-07 08:32:43 -07:00