0.0.94

add missing import (#56 )
final reminder
2026-02-12 11:34:29 +01:00 · 2022-05-04 07:42:33 -07:00 · 2022-05-04 07:42:20 -07:00 · 2022-05-03 08:18:53 -07:00 · 2022-05-03 08:17:02 -07:00 · 2022-05-03 08:15:25 -07:00
5 changed files with 52 additions and 14 deletions
--- a/README.md
+++ b/README.md
@@ -821,7 +821,7 @@ Once built, images will be saved to the same directory the command is invoked
 - [x] just take care of the training for the decoder in a wrapper class, as each unet in the cascade will need its own optimizer
 - [x] bring in tools to train vqgan-vae
 - [x] add convnext backbone for vqgan-vae (in addition to vit [vit-vqgan] + resnet)
- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet
+- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo)
 - [ ] copy the cascading ddpm code to a separate repo (perhaps https://github.com/lucidrains/denoising-diffusion-pytorch) as the main contribution of dalle2 really is just the prior network
 - [ ] transcribe code to Jax, which lowers the activation energy for distributed training, given access to TPUs
 - [ ] pull logic for training diffusion prior into a class DiffusionPriorTrainer, for eventual script based + CLI based training
@@ -831,6 +831,11 @@ Once built, images will be saved to the same directory the command is invoked
 - [ ] bring in cross-scale embedding from iclr paper https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/crossformer.py#L14
 - [ ] figure out if possible to augment with external memory, as described in https://arxiv.org/abs/2204.11824
 - [ ] test out grid attention in cascading ddpm locally, decide whether to keep or remove
+- [ ] use an experimental tracker agnostic setup, as done <a href="https://github.com/lucidrains/tf-bind-transformer#simple-trainer-class-for-fine-tuning">here</a>
+- [ ] make sure for the latter unets in the cascade, one can train on crops for learning super resolution (constrain the unet to be only convolutions in that case, or allow conv-like attention with rel pos bias)
+- [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2
+- [ ] make sure FILIP works with DALL-E2 from x-clip https://arxiv.org/abs/2111.07783
+- [ ] make sure DDPMs can be run with traditional resnet blocks (but leave convnext as an option for experimentation)

 ## Citations

@@ -896,4 +901,14 @@ Once built, images will be saved to the same directory the command is invoked
 }
 ```

+```bibtex
+@article{Shleifer2021NormFormerIT,
+    title   = {NormFormer: Improved Transformer Pretraining with Extra Normalization},
+    author  = {Sam Shleifer and Jason Weston and Myle Ott},
+    journal = {ArXiv},
+    year    = {2021},
+    volume  = {abs/2110.09456}
+}
+```
+
 *Creating noise from data is easy; creating data from noise is generative modeling.* - Yang Song's <a href="https://arxiv.org/abs/2011.13456">paper</a>
--- a/dalle2_pytorch/cli.py
+++ b/dalle2_pytorch/cli.py
@@ -1,6 +1,7 @@
 import click
 import torch
 import torchvision.transforms as T
+from functools import reduce
 from pathlib import Path

 from dalle2_pytorch import DALLE2, Decoder, DiffusionPrior
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -499,7 +499,12 @@ class SwiGLU(nn.Module):
        x, gate = x.chunk(2, dim = -1)
        return x * F.silu(gate)

-def FeedForward(dim, mult = 4, dropout = 0., post_activation_norm = False):
+def FeedForward(
+    dim,
+    mult = 4,
+    dropout = 0.,
+    post_activation_norm = False
+):
    """ post-activation norm https://arxiv.org/abs/2110.09456 """

    inner_dim = int(mult * dim)
@@ -522,7 +527,8 @@ class Attention(nn.Module):
        dim_head = 64,
        heads = 8,
        dropout = 0.,
-        causal = False
+        causal = False,
+        post_norm = False
    ):
        super().__init__()
        self.scale = dim_head ** -0.5
@@ -537,7 +543,11 @@ class Attention(nn.Module):
        self.null_kv = nn.Parameter(torch.randn(2, dim_head))
        self.to_q = nn.Linear(dim, inner_dim, bias = False)
        self.to_kv = nn.Linear(dim, dim_head * 2, bias = False)
-        self.to_out = nn.Linear(inner_dim, dim, bias = False)
+
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, dim, bias = False),
+            LayerNorm(dim) if post_norm else nn.Identity()
+        )

    def forward(self, x, mask = None, attn_bias = None):
        b, n, device = *x.shape[:2], x.device
@@ -599,10 +609,11 @@ class CausalTransformer(nn.Module):
        dim_head = 64,
        heads = 8,
        ff_mult = 4,
-        norm_out = False,
+        norm_out = True,
        attn_dropout = 0.,
        ff_dropout = 0.,
-        final_proj = True
+        final_proj = True,
+        normformer = False
    ):
        super().__init__()
        self.rel_pos_bias = RelPosBias(heads = heads)
@@ -610,8 +621,8 @@ class CausalTransformer(nn.Module):
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
-                Attention(dim = dim, causal = True, dim_head = dim_head, heads = heads, dropout = attn_dropout),
-                FeedForward(dim = dim, mult = ff_mult, dropout = ff_dropout)
+                Attention(dim = dim, causal = True, dim_head = dim_head, heads = heads, dropout = attn_dropout, post_norm = normformer),
+                FeedForward(dim = dim, mult = ff_mult, dropout = ff_dropout, post_activation_norm = normformer)
            ]))

        self.norm = LayerNorm(dim) if norm_out else nn.Identity()  # unclear in paper whether they projected after the classic layer norm for the final denoised image embedding, or just had the transformer output it directly: plan on offering both options
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.0.91',
+  version = '0.0.94',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -7,6 +7,9 @@ from torch import nn
 from embedding_reader import EmbeddingReader
 from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork
 from dalle2_pytorch.optimizer import get_optimizer
+from dalle2_pytorch.optimizer import get_optimizer
+from torch.cuda.amp import autocast,GradScaler
+

 import time
 from tqdm import tqdm
@@ -53,6 +56,8 @@ def train(image_embed_dim,
          clip,
          dp_condition_on_text_encodings,
          dp_timesteps,
+          dp_l2norm_output,
+          dp_normformer,
          dp_cond_drop_prob,
          dpn_depth,
          dpn_dim_head,
@@ -70,7 +75,9 @@ def train(image_embed_dim,
            dim = image_embed_dim, 
            depth = dpn_depth, 
            dim_head = dpn_dim_head, 
-            heads = dpn_heads).to(device)
+            heads = dpn_heads,
+            normformer = dp_normformer,
+            l2norm_output = dp_l2norm_output).to(device)
    
    # DiffusionPrior with text embeddings and image embeddings pre-computed
    diffusion_prior = DiffusionPrior( 
@@ -132,7 +139,7 @@ def train(image_embed_dim,
                        "Samples per second": samples_per_sec})

            scaler.unscale_(optimizer)
-            nn.init.clip_grad_norm_(diffusion_prior.parameters(), max_grad_norm)
+            nn.utils.clip_grad_norm_(diffusion_prior.parameters(), max_grad_norm)

            scaler.step(optimizer)
            scaler.update()
@@ -161,8 +168,8 @@ def main():
    parser.add_argument("--image-embed-url", type=str, default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/")
    parser.add_argument("--text-embed-url", type=str, default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/")
    # Hyperparameters
-    parser.add_argument("--learning-rate", type=float, default=0.001)
-    parser.add_argument("--weight-decay", type=float, default=0.01)
+    parser.add_argument("--learning-rate", type=float, default=1.1e-4)
+    parser.add_argument("--weight-decay", type=float, default=6.02e-2)
    parser.add_argument("--max-grad-norm", type=float, default=0.5)
    parser.add_argument("--batch-size", type=int, default=10**4)
    parser.add_argument("--num-epochs", type=int, default=5)
@@ -180,7 +187,9 @@ def main():
    # DiffusionPrior(dp) parameters
    parser.add_argument("--dp-condition-on-text-encodings", type=bool, default=False)
    parser.add_argument("--dp-timesteps", type=int, default=100)
-    parser.add_argument("--dp-cond-drop-prob", type=float, default=0.2)
+    parser.add_argument("--dp-l2norm-output", type=bool, default=False)
+    parser.add_argument("--dp-normformer", type=bool, default=False)
+    parser.add_argument("--dp-cond-drop-prob", type=float, default=0.1)
    parser.add_argument("--dp-loss-type", type=str, default="l2")
    parser.add_argument("--clip", type=str, default=None)
    parser.add_argument("--amp", type=bool, default=False)
@@ -223,6 +232,8 @@ def main():
          args.clip,
          args.dp_condition_on_text_encodings,
          args.dp_timesteps,
+          args.dp_l2norm_output,
+          args.dp_normformer,
          args.dp_cond_drop_prob,
          args.dpn_depth,
          args.dpn_dim_head,
Author	SHA1	Message	Date
Phil Wang	58d9b422f3	0.0.94	2022-05-04 07:42:33 -07:00
Ray Bell	44b319cb57	add missing import (#56 )	2022-05-04 07:42:20 -07:00
Phil Wang	c30f380689	final reminder	2022-05-03 08:18:53 -07:00
Phil Wang	e4e884bb8b	keep all doors open	2022-05-03 08:17:02 -07:00
Phil Wang	803ad9c17d	product management again	2022-05-03 08:15:25 -07:00
Phil Wang	a88dd6a9c0	todo	2022-05-03 08:09:02 -07:00
Kumar R	72c16b496e	Update train_diffusion_prior.py (#53 )	2022-05-02 22:44:57 -07:00
z	81d83dd7f2	defaults align with paper (#52 ) Co-authored-by: nousr <>	2022-05-02 13:52:11 -07:00
Phil Wang	fa66f7e1e9	todo	2022-05-02 12:57:15 -07:00
Phil Wang	aa8d135245	allow laion to experiment with normformer in diffusion prior	2022-05-02 11:35:00 -07:00
Phil Wang	70282de23b	add ability to turn on normformer settings, given @borisdayma reported good results and some personal anecdata	2022-05-02 11:33:15 -07:00
Phil Wang	83f761847e	todo	2022-05-02 10:52:39 -07:00
Phil Wang	11469dc0c6	makes more sense to keep this as True as default, for stability	2022-05-02 10:50:55 -07:00
Romain Beaumont	2d25c89f35	Fix passing of l2norm_output to DiffusionPriorNetwork (#51 )	2022-05-02 10:48:16 -07:00
Phil Wang	3fe96c208a	add ability to train diffusion prior with l2norm on output image embed	2022-05-02 09:53:20 -07:00