complete imagen-like noise level conditioning

project management
offer way to turn off initial cross embed convolutional module, for debugging upsampler artifacts
2026-02-12 11:34:29 +01:00 · 2022-07-18 13:43:57 -07:00 · 2022-07-17 17:27:44 -07:00 · 2022-07-15 17:29:10 -07:00 · 2022-07-14 15:28:34 -07:00 · 2022-07-14 12:49:43 -07:00
8 changed files with 351 additions and 128 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -1 +1 @@
-github: [lucidrains]
+github: [nousr, Veldrovive, lucidrains]
--- a/README.md
+++ b/README.md
@@ -45,6 +45,7 @@ This library would not have gotten to this working state without the help of
 - <a href="https://github.com/rom1504">Romain</a> for the pull request reviews and project management
 - <a href="https://github.com/Ciaohe">He Cao</a> and <a href="https://github.com/xiankgx">xiankgx</a> for the Q&A and for identifying of critical bugs
 - <a href="https://github.com/marunine">Marunine</a> for identifying issues with resizing of the low resolution conditioner, when training the upsampler, in addition to various other bug fixes
+- <a href="https://github.com/malumadev">MalumaDev</a> for proposing the use of pixel shuffle upsampler for fixing checkboard artifacts
 - <a href="https://github.com/crowsonkb">Katherine</a> for her advice
 - <a href="https://stability.ai/">Stability AI</a> for the generous sponsorship
 - <a href="https://huggingface.co">🤗 Huggingface</a> and in particular <a href="https://github.com/sgugger">Sylvain</a> for the <a href="https://github.com/huggingface/accelerate">Accelerate</a> library
@@ -355,7 +356,8 @@ prior_network = DiffusionPriorNetwork(
 diffusion_prior = DiffusionPrior(
    net = prior_network,
    clip = clip,
-    timesteps = 100,
+    timesteps = 1000,
+    sample_timesteps = 64,
    cond_drop_prob = 0.2
 ).cuda()

@@ -419,7 +421,7 @@ For the layperson, no worries, training will all be automated into a CLI tool, a

 ## Training on Preprocessed CLIP Embeddings

-It is likely, when scaling up, that you would first preprocess your images and text into corresponding embeddings before training the prior network. You can do so easily by simply passing in `image_embed`, `text_embed`, and optionally `text_encodings` and `text_mask`
+It is likely, when scaling up, that you would first preprocess your images and text into corresponding embeddings before training the prior network. You can do so easily by simply passing in `image_embed`, `text_embed`, and optionally `text_encodings`

 Working example below

@@ -1046,11 +1048,10 @@ Once built, images will be saved to the same directory the command is invoked
 - [x] bring in skip-layer excitations (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training (doesnt work well)
 - [x] test out grid attention in cascading ddpm locally, decide whether to keep or remove https://arxiv.org/abs/2204.01697 (keeping, seems to be fine)
 - [x] allow for unet to be able to condition non-cross attention style as well
- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo) - consider https://github.com/lucidrains/uformer-pytorch attention-based unet
- [ ] speed up inference, read up on papers (ddim or diffusion-gan, etc)
- [ ] figure out if possible to augment with external memory, as described in https://arxiv.org/abs/2204.11824
- [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2
+- [x] speed up inference, read up on papers (ddim)
 - [ ] add inpainting ability using resampler from repaint paper https://arxiv.org/abs/2201.09865
+- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo) - consider https://github.com/lucidrains/uformer-pytorch attention-based unet
+- [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2

 ## Citations

--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
--- a/dalle2_pytorch/train_configs.py
+++ b/dalle2_pytorch/train_configs.py
@@ -129,6 +129,7 @@ class AdapterConfig(BaseModel):
 class DiffusionPriorNetworkConfig(BaseModel):
    dim: int
    depth: int
+    max_text_len: int = None
    num_timesteps: int = None
    num_time_embeds: int = 1
    num_image_embeds: int = 1
@@ -136,6 +137,7 @@ class DiffusionPriorNetworkConfig(BaseModel):
    dim_head: int = 64
    heads: int = 8
    ff_mult: int = 4
+    norm_in: bool = False
    norm_out: bool = True
    attn_dropout: float = 0.
    ff_dropout: float = 0.
@@ -223,6 +225,7 @@ class UnetConfig(BaseModel):
    self_attn: ListOrTuple(int)
    attn_dim_head: int = 32
    attn_heads: int = 16
+    init_cross_embed: bool = True

    class Config:
        extra = "allow"
--- a/dalle2_pytorch/trainer.py
+++ b/dalle2_pytorch/trainer.py
@@ -673,8 +673,14 @@ class DecoderTrainer(nn.Module):
    def sample(self, *args, **kwargs):
        distributed = self.accelerator.num_processes > 1
        base_decoder = self.accelerator.unwrap_model(self.decoder)
+
+        was_training = base_decoder.training
+        base_decoder.eval()
+
        if kwargs.pop('use_non_ema', False) or not self.use_ema:
-            return base_decoder.sample(*args, **kwargs, distributed = distributed)
+            out = base_decoder.sample(*args, **kwargs, distributed = distributed)
+            base_decoder.train(was_training)
+            return out

        trainable_unets = self.accelerator.unwrap_model(self.decoder).unets
        base_decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling
@@ -687,6 +693,7 @@ class DecoderTrainer(nn.Module):
        for ema in self.ema_unets:
            ema.restore_ema_model_device()

+        base_decoder.train(was_training)
        return output

    @torch.no_grad()
--- a/dalle2_pytorch/version.py
+++ b/dalle2_pytorch/version.py
@@ -1 +1 @@
-__version__ = '0.19.4'
+__version__ = '0.25.0'
--- a/train_decoder.py
+++ b/train_decoder.py
@@ -323,7 +323,7 @@ def train(
        last_snapshot = sample

        if next_task == 'train':
-            for i, (img, emb, txt) in enumerate(trainer.train_loader):
+            for i, (img, emb, txt) in enumerate(dataloaders["train"]):
                # We want to count the total number of samples across all processes
                sample_length_tensor[0] = len(img)
                all_samples = accelerator.gather(sample_length_tensor)  # TODO: accelerator.reduce is broken when this was written. If it is fixed replace this.
@@ -358,6 +358,7 @@ def train(
                        else:
                            # Then we need to pass the text instead
                            tokenized_texts = tokenize(txt, truncate=True)
+                            assert tokenized_texts.shape[0] == len(img), f"The number of texts ({tokenized_texts.shape[0]}) should be the same as the number of images ({len(img)})"
                            forward_params['text'] = tokenized_texts
                    loss = trainer.forward(img, **forward_params, unet_number=unet)
                    trainer.update(unet_number=unet)
@@ -416,7 +417,7 @@ def train(
            timer = Timer()
            accelerator.wait_for_everyone()
            i = 0
-            for i, (img, emb, txt) in enumerate(trainer.val_loader):  # Use the accelerate prepared loader
+            for i, (img, emb, txt) in enumerate(dataloaders['val']):  # Use the accelerate prepared loader
                val_sample_length_tensor[0] = len(img)
                all_samples = accelerator.gather(val_sample_length_tensor)
                total_samples = all_samples.sum().item()
@@ -557,7 +558,7 @@ def initialize_training(config: TrainDecoderConfig, config_path):

    # Create the decoder model and print basic info
    decoder = config.decoder.create()
-    num_parameters = sum(p.numel() for p in decoder.parameters())
+    get_num_parameters = lambda model, only_training=False: sum(p.numel() for p in model.parameters() if (p.requires_grad or not only_training))

    # Create and initialize the tracker if we are the master
    tracker = create_tracker(accelerator, config, config_path, dummy = rank!=0)
@@ -586,7 +587,10 @@ def initialize_training(config: TrainDecoderConfig, config_path):
    accelerator.print(print_ribbon("Loaded Config", repeat=40))
    accelerator.print(f"Running training with {accelerator.num_processes} processes and {accelerator.distributed_type} distributed training")
    accelerator.print(f"Training using {data_source_string}. {'conditioned on text' if conditioning_on_text else 'not conditioned on text'}")
-    accelerator.print(f"Number of parameters: {num_parameters}")
+    accelerator.print(f"Number of parameters: {get_num_parameters(decoder)} total; {get_num_parameters(decoder, only_training=True)} training")
+    for i, unet in enumerate(decoder.unets):
+        accelerator.print(f"Unet {i} has {get_num_parameters(unet)} total; {get_num_parameters(unet, only_training=True)} training")
+
    train(dataloaders, decoder, accelerator,
        tracker=tracker,
        inference_device=accelerator.device,
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -126,9 +126,9 @@ def report_cosine_sims(

        # we are text conditioned, we produce an embedding from the tokenized text
        if text_conditioned:
-            text_embedding, text_encodings, text_mask = trainer.embed_text(text_data)
+            text_embedding, text_encodings = trainer.embed_text(text_data)
            text_cond = dict(
-                text_embed=text_embedding, text_encodings=text_encodings, mask=text_mask
+                text_embed=text_embedding, text_encodings=text_encodings
            )
        else:
            text_embedding = text_data
@@ -146,15 +146,12 @@ def report_cosine_sims(

        if text_conditioned:
            text_encodings_shuffled = text_encodings[rolled_idx]
-            text_mask_shuffled = text_mask[rolled_idx]
        else:
            text_encodings_shuffled = None
-            text_mask_shuffled = None

        text_cond_shuffled = dict(
            text_embed=text_embed_shuffled,
-            text_encodings=text_encodings_shuffled,
-            mask=text_mask_shuffled,
+            text_encodings=text_encodings_shuffled
        )

        # prepare the text embedding
Author	SHA1	Message	Date
Phil Wang	6afb886cf4	complete imagen-like noise level conditioning	2022-07-18 13:43:57 -07:00
Phil Wang	c7fe4f2f44	project management	2022-07-17 17:27:44 -07:00
Phil Wang	a2ee3fa3cc	offer way to turn off initial cross embed convolutional module, for debugging upsampler artifacts	2022-07-15 17:29:10 -07:00
Phil Wang	a58a370d75	takes care of a grad strides error at https://github.com/lucidrains/DALLE2-pytorch/issues/196 thanks to @YUHANG-Ma	2022-07-14 15:28:34 -07:00
Phil Wang	1662bbf226	protect against random cropping for base unet	2022-07-14 12:49:43 -07:00
Phil Wang	5be1f57448	update	2022-07-14 12:03:42 -07:00
Phil Wang	c52ce58e10	update	2022-07-14 10:54:51 -07:00
Phil Wang	a34f60962a	let the neural network peek at the low resolution conditioning one last time before making prediction, for upsamplers	2022-07-14 10:27:04 -07:00
Phil Wang	0b40cbaa54	just always use nearest neighbor interpolation when resizing for low resolution conditioning, for https://github.com/lucidrains/DALLE2-pytorch/pull/181	2022-07-13 20:59:43 -07:00
Phil Wang	f141144a6d	allow for using classifier free guidance for some unets but not others, by passing in a tuple of cond_scale during sampling for decoder, just in case it is causing issues for upsamplers	2022-07-13 13:12:30 -07:00
Phil Wang	f988207718	hack around some inplace error, also make sure for openai clip text encoding, only tokens after eos_id is masked out	2022-07-13 12:56:02 -07:00
Phil Wang	b2073219f0	foolproof sampling for decoder to always use eval mode (and restore training state afterwards)	2022-07-13 10:21:00 -07:00
Phil Wang	cc0f7a935c	fix non pixel shuffle upsample	2022-07-13 10:16:02 -07:00
Phil Wang	95a512cb65	fix a potential bug with conditioning with blurred low resolution image, blur should be applied only 50% of the time	2022-07-13 10:11:49 -07:00
Phil Wang	972ee973bc	fix issue with ddim and normalization of lowres conditioning image	2022-07-13 09:48:40 -07:00
Phil Wang	79e2a3bc77	only use the stable layernorm for final output norm in transformer	2022-07-13 07:56:30 -07:00
Aidan Dempster	544cdd0b29	Reverted to using basic dataloaders (#205 ) Accelerate removes the ability to collate strings. Likely since it cannot gather strings.	2022-07-12 18:22:27 -07:00
Phil Wang	349aaca56f	add yet another transformer stability measure	2022-07-12 17:49:16 -07:00
Phil Wang	3ee3c56d2a	add learned padding tokens, same strategy as dalle1, for diffusion prior, and get rid of masking in causal transformer	2022-07-12 17:33:14 -07:00
Phil Wang	cd26c6b17d	0.22.3	2022-07-12 17:08:31 -07:00
Phil Wang	775abc4df6	add setting to attend to all text encodings regardless of padding, for diffusion prior	2022-07-12 17:08:12 -07:00
Phil Wang	11b1d533a0	make sure text encodings being passed in has the correct batch dimension	2022-07-12 16:00:19 -07:00
Phil Wang	e76e89f9eb	remove text masking altogether in favor of deriving from text encodings (padded text encodings must be pad value of 0.)	2022-07-12 15:40:31 -07:00
Phil Wang	bb3ff0ac67	protect against bad text mask being passed into decoder	2022-07-12 15:33:13 -07:00
Phil Wang	1ec4dbe64f	one more fix for text mask, if the length of the text encoding exceeds max_text_len, add an assert for better error msg	2022-07-12 15:01:46 -07:00
Phil Wang	e0835acca9	generate text mask within the unet and diffusion prior itself from the text encodings, if not given	2022-07-12 12:54:59 -07:00
Phil Wang	e055793e5d	shoutout for @MalumaDev	2022-07-11 16:12:35 -07:00
Phil Wang	1d9ef99288	add PixelShuffleUpsample thanks to @MalumaDev and @marunine for running the experiment and verifyng absence of checkboard artifacts	2022-07-11 16:07:23 -07:00
Phil Wang	bdd62c24b3	zero init final projection in unet, since openai and @crowsonkb are both doing it	2022-07-11 13:22:06 -07:00
Phil Wang	1f1557c614	make it so even if text mask is omitted, it will be derived based on whether text encodings are all 0s or not, simplify dataloading	2022-07-11 10:56:19 -07:00
Aidan Dempster	1a217e99e3	Unet parameter count is now shown (#202 )	2022-07-10 16:45:59 -07:00
Phil Wang	7ea314e2f0	allow for final l2norm clamping of the sampled image embed	2022-07-10 09:44:38 -07:00
Phil Wang	4173e88121	more accurate readme	2022-07-09 20:57:26 -07:00
Phil Wang	3dae43fa0e	fix misnamed variable, thanks to @nousr	2022-07-09 19:01:37 -07:00