complete inpainting ability using inpaint_image and inpaint_mask passed into sample function for decoder

fix a bug with ddim and predict x0 objective
comments
2026-02-12 11:34:29 +01:00 · 2022-07-19 09:26:55 -07:00 · 2022-07-18 19:04:26 -07:00 · 2022-07-18 15:02:04 -07:00 · 2022-07-18 13:50:22 -07:00 · 2022-07-18 13:43:57 -07:00
10 changed files with 667 additions and 157 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -1 +1 @@
-github: [lucidrains]
+github: [nousr, Veldrovive, lucidrains]
--- a/README.md
+++ b/README.md
@@ -45,6 +45,7 @@ This library would not have gotten to this working state without the help of
 - <a href="https://github.com/rom1504">Romain</a> for the pull request reviews and project management
 - <a href="https://github.com/Ciaohe">He Cao</a> and <a href="https://github.com/xiankgx">xiankgx</a> for the Q&A and for identifying of critical bugs
 - <a href="https://github.com/marunine">Marunine</a> for identifying issues with resizing of the low resolution conditioner, when training the upsampler, in addition to various other bug fixes
+- <a href="https://github.com/malumadev">MalumaDev</a> for proposing the use of pixel shuffle upsampler for fixing checkboard artifacts
 - <a href="https://github.com/crowsonkb">Katherine</a> for her advice
 - <a href="https://stability.ai/">Stability AI</a> for the generous sponsorship
 - <a href="https://huggingface.co">🤗 Huggingface</a> and in particular <a href="https://github.com/sgugger">Sylvain</a> for the <a href="https://github.com/huggingface/accelerate">Accelerate</a> library
@@ -355,7 +356,8 @@ prior_network = DiffusionPriorNetwork(
 diffusion_prior = DiffusionPrior(
    net = prior_network,
    clip = clip,
-    timesteps = 100,
+    timesteps = 1000,
+    sample_timesteps = 64,
    cond_drop_prob = 0.2
 ).cuda()

@@ -419,7 +421,7 @@ For the layperson, no worries, training will all be automated into a CLI tool, a

 ## Training on Preprocessed CLIP Embeddings

-It is likely, when scaling up, that you would first preprocess your images and text into corresponding embeddings before training the prior network. You can do so easily by simply passing in `image_embed`, `text_embed`, and optionally `text_encodings` and `text_mask`
+It is likely, when scaling up, that you would first preprocess your images and text into corresponding embeddings before training the prior network. You can do so easily by simply passing in `image_embed`, `text_embed`, and optionally `text_encodings`

 Working example below

@@ -583,6 +585,7 @@ unet1 = Unet(
    cond_dim = 128,
    channels = 3,
    dim_mults=(1, 2, 4, 8),
+    text_embed_dim = 512,
    cond_on_text_encodings = True  # set to True for any unets that need to be conditioned on text encodings (ex. first unet in cascade)
 ).cuda()

@@ -598,7 +601,8 @@ decoder = Decoder(
    unet = (unet1, unet2),
    image_sizes = (128, 256),
    clip = clip,
-    timesteps = 100,
+    timesteps = 1000,
+    sample_timesteps = (250, 27),
    image_cond_drop_prob = 0.1,
    text_cond_drop_prob = 0.5
 ).cuda()
@@ -1044,11 +1048,10 @@ Once built, images will be saved to the same directory the command is invoked
 - [x] bring in skip-layer excitations (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training (doesnt work well)
 - [x] test out grid attention in cascading ddpm locally, decide whether to keep or remove https://arxiv.org/abs/2204.01697 (keeping, seems to be fine)
 - [x] allow for unet to be able to condition non-cross attention style as well
- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo) - consider https://github.com/lucidrains/uformer-pytorch attention-based unet
- [ ] speed up inference, read up on papers (ddim or diffusion-gan, etc)
- [ ] figure out if possible to augment with external memory, as described in https://arxiv.org/abs/2204.11824
+- [x] speed up inference, read up on papers (ddim)
+- [x] add inpainting ability using resampler from repaint paper https://arxiv.org/abs/2201.09865
+- [ ] try out the nested unet from https://arxiv.org/abs/2005.09007 after hearing several positive testimonies from researchers, for segmentation anyhow
 - [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2
- [ ] add inpainting ability using resampler from repaint paper https://arxiv.org/abs/2201.09865

 ## Citations

--- a/configs/train_decoder_config.test.json
+++ b/configs/train_decoder_config.test.json
@@ -41,7 +41,7 @@
        "resample_train": true,
        "preprocessing": {
            "RandomResizedCrop": {
-                "size": [64, 64],
+                "size": [224, 224],
                "scale": [0.75, 1.0],
                "ratio": [1.0, 1.0]
            },
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
--- a/dalle2_pytorch/dataloaders/decoder_loader.py
+++ b/dalle2_pytorch/dataloaders/decoder_loader.py
@@ -1,6 +1,7 @@
 import os
 import webdataset as wds
 import torch
+from torch.utils.data import DataLoader
 import numpy as np
 import fsspec
 import shutil
@@ -255,7 +256,7 @@ def create_image_embedding_dataloader(
    )
    if shuffle_num is not None and shuffle_num > 0:
        ds.shuffle(1000)
-    return wds.WebLoader(
+    return DataLoader(
        ds,
        num_workers=num_workers,
        batch_size=batch_size,
--- a/dalle2_pytorch/train_configs.py
+++ b/dalle2_pytorch/train_configs.py
@@ -129,6 +129,7 @@ class AdapterConfig(BaseModel):
 class DiffusionPriorNetworkConfig(BaseModel):
    dim: int
    depth: int
+    max_text_len: int = None
    num_timesteps: int = None
    num_time_embeds: int = 1
    num_image_embeds: int = 1
@@ -136,6 +137,7 @@ class DiffusionPriorNetworkConfig(BaseModel):
    dim_head: int = 64
    heads: int = 8
    ff_mult: int = 4
+    norm_in: bool = False
    norm_out: bool = True
    attn_dropout: float = 0.
    ff_dropout: float = 0.
@@ -154,6 +156,7 @@ class DiffusionPriorConfig(BaseModel):
    image_size: int
    image_channels: int = 3
    timesteps: int = 1000
+    sample_timesteps: Optional[int] = None
    cond_drop_prob: float = 0.
    loss_type: str = 'l2'
    predict_x_start: bool = True
@@ -222,6 +225,7 @@ class UnetConfig(BaseModel):
    self_attn: ListOrTuple(int)
    attn_dim_head: int = 32
    attn_heads: int = 16
+    init_cross_embed: bool = True

    class Config:
        extra = "allow"
@@ -233,6 +237,7 @@ class DecoderConfig(BaseModel):
    clip: Optional[AdapterConfig]   # The clip model to use if embeddings are not provided
    channels: int = 3
    timesteps: int = 1000
+    sample_timesteps: Optional[SingularOrIterable(int)] = None
    loss_type: str = 'l2'
    beta_schedule: ListOrTuple(str) = 'cosine'
    learned_variance: bool = True
--- a/dalle2_pytorch/trainer.py
+++ b/dalle2_pytorch/trainer.py
@@ -21,7 +21,7 @@ import pytorch_warmup as warmup

 from ema_pytorch import EMA

-from accelerate import Accelerator
+from accelerate import Accelerator, DistributedType

 import numpy as np

@@ -76,6 +76,7 @@ def cast_torch_tensor(fn):
    def inner(model, *args, **kwargs):
        device = kwargs.pop('_device', next(model.parameters()).device)
        cast_device = kwargs.pop('_cast_device', True)
+        cast_deepspeed_precision = kwargs.pop('_cast_deepspeed_precision', True)

        kwargs_keys = kwargs.keys()
        all_args = (*args, *kwargs.values())
@@ -85,6 +86,21 @@ def cast_torch_tensor(fn):
        if cast_device:
            all_args = tuple(map(lambda t: t.to(device) if exists(t) and isinstance(t, torch.Tensor) else t, all_args))

+        if cast_deepspeed_precision:
+            try:
+                accelerator = model.accelerator
+                if accelerator is not None and accelerator.distributed_type == DistributedType.DEEPSPEED:
+                    cast_type_map = {
+                        "fp16": torch.half,
+                        "bf16": torch.bfloat16,
+                        "no": torch.float
+                    }
+                    precision_type = cast_type_map[accelerator.mixed_precision]
+                    all_args = tuple(map(lambda t: t.to(precision_type) if exists(t) and isinstance(t, torch.Tensor) else t, all_args))
+            except AttributeError:
+                # Then this model doesn't have an accelerator
+                pass
+
        args, kwargs_values = all_args[:split_kwargs_index], all_args[split_kwargs_index:]
        kwargs = dict(tuple(zip(kwargs_keys, kwargs_values)))

@@ -446,6 +462,7 @@ class DecoderTrainer(nn.Module):
        self,
        decoder,
        accelerator = None,
+        dataloaders = None,
        use_ema = True,
        lr = 1e-4,
        wd = 1e-2,
@@ -508,10 +525,31 @@ class DecoderTrainer(nn.Module):

        self.register_buffer('steps', torch.tensor([0] * self.num_unets))

+        if self.accelerator.distributed_type == DistributedType.DEEPSPEED and decoder.clip is not None:
+            # Then we need to make sure clip is using the correct precision or else deepspeed will error
+            cast_type_map = {
+                "fp16": torch.half,
+                "bf16": torch.bfloat16,
+                "no": torch.float
+            }
+            precision_type = cast_type_map[accelerator.mixed_precision]
+            assert precision_type == torch.float, "DeepSpeed currently only supports float32 precision when using on the fly embedding generation from clip"
+            clip = decoder.clip
+            clip.to(precision_type)
+
        decoder, *optimizers = list(self.accelerator.prepare(decoder, *optimizers))

        self.decoder = decoder

+        # prepare dataloaders
+
+        train_loader = val_loader = None
+        if exists(dataloaders):
+            train_loader, val_loader = self.accelerator.prepare(dataloaders["train"], dataloaders["val"])
+
+        self.train_loader = train_loader
+        self.val_loader = val_loader
+
        # store optimizers

        for opt_ind, optimizer in zip(range(len(optimizers)), optimizers):
@@ -635,8 +673,14 @@ class DecoderTrainer(nn.Module):
    def sample(self, *args, **kwargs):
        distributed = self.accelerator.num_processes > 1
        base_decoder = self.accelerator.unwrap_model(self.decoder)
+
+        was_training = base_decoder.training
+        base_decoder.eval()
+
        if kwargs.pop('use_non_ema', False) or not self.use_ema:
-            return base_decoder.sample(*args, **kwargs, distributed = distributed)
+            out = base_decoder.sample(*args, **kwargs, distributed = distributed)
+            base_decoder.train(was_training)
+            return out

        trainable_unets = self.accelerator.unwrap_model(self.decoder).unets
        base_decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling
@@ -649,6 +693,7 @@ class DecoderTrainer(nn.Module):
        for ema in self.ema_unets:
            ema.restore_ema_model_device()

+        base_decoder.train(was_training)
        return output

    @torch.no_grad()
@@ -675,6 +720,9 @@ class DecoderTrainer(nn.Module):

        total_loss = 0.

+        
+        using_amp = self.accelerator.mixed_precision != 'no'
+
        for chunk_size_frac, (chunked_args, chunked_kwargs) in split_args_and_kwargs(*args, split_size = max_batch_size, **kwargs):
            with self.accelerator.autocast():
                loss = self.decoder(*chunked_args, unet_number = unet_number, **chunked_kwargs)
--- a/dalle2_pytorch/version.py
+++ b/dalle2_pytorch/version.py
@@ -1 +1 @@
-__version__ = '0.17.1'
+__version__ = '0.26.0'
--- a/train_decoder.py
+++ b/train_decoder.py
@@ -274,6 +274,7 @@ def train(
    trainer = DecoderTrainer(
        decoder=decoder,
        accelerator=accelerator,
+        dataloaders=dataloaders,
        **kwargs
    )

@@ -284,7 +285,6 @@ def train(
    sample = 0
    samples_seen = 0
    val_sample = 0
-    step = lambda: int(trainer.num_steps_taken(unet_number=1))

    if tracker.can_recall:
        start_epoch, validation_losses, next_task, recalled_sample, samples_seen = recall_trainer(tracker, trainer)
@@ -299,6 +299,8 @@ def train(
    if not exists(unet_training_mask):
        # Then the unet mask should be true for all unets in the decoder
        unet_training_mask = [True] * trainer.num_unets
+    first_training_unet = min(index for index, mask in enumerate(unet_training_mask) if mask)
+    step = lambda: int(trainer.num_steps_taken(unet_number=first_training_unet+1))
    assert len(unet_training_mask) == trainer.num_unets, f"The unet training mask should be the same length as the number of unets in the decoder. Got {len(unet_training_mask)} and {trainer.num_unets}"

    accelerator.print(print_ribbon("Generating Example Data", repeat=40))
@@ -356,6 +358,7 @@ def train(
                        else:
                            # Then we need to pass the text instead
                            tokenized_texts = tokenize(txt, truncate=True)
+                            assert tokenized_texts.shape[0] == len(img), f"The number of texts ({tokenized_texts.shape[0]}) should be the same as the number of images ({len(img)})"
                            forward_params['text'] = tokenized_texts
                    loss = trainer.forward(img, **forward_params, unet_number=unet)
                    trainer.update(unet_number=unet)
@@ -414,7 +417,7 @@ def train(
            timer = Timer()
            accelerator.wait_for_everyone()
            i = 0
-            for i, (img, emb, txt) in enumerate(dataloaders["val"]):
+            for i, (img, emb, txt) in enumerate(dataloaders['val']):  # Use the accelerate prepared loader
                val_sample_length_tensor[0] = len(img)
                all_samples = accelerator.gather(val_sample_length_tensor)
                total_samples = all_samples.sum().item()
@@ -519,6 +522,20 @@ def initialize_training(config: TrainDecoderConfig, config_path):
    # Set up accelerator for configurable distributed training
    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=config.train.find_unused_parameters)
    accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])
+
+    if accelerator.num_processes > 1:
+        # We are using distributed training and want to immediately ensure all can connect
+        accelerator.print("Waiting for all processes to connect...")
+        accelerator.wait_for_everyone()
+        accelerator.print("All processes online and connected")
+
+    # If we are in deepspeed fp16 mode, we must ensure learned variance is off
+    if accelerator.mixed_precision == "fp16" and accelerator.distributed_type == accelerate_dataclasses.DistributedType.DEEPSPEED and config.decoder.learned_variance:
+        raise ValueError("DeepSpeed fp16 mode does not support learned variance")
+
+    if accelerator.process_index != accelerator.local_process_index and accelerator.distributed_type == accelerate_dataclasses.DistributedType.DEEPSPEED:
+        # This is an invalid configuration until we figure out how to handle this
+        raise ValueError("DeepSpeed does not support multi-node distributed training")
    
    # Set up data
    all_shards = list(range(config.data.start_shard, config.data.end_shard + 1))
@@ -541,7 +558,7 @@ def initialize_training(config: TrainDecoderConfig, config_path):

    # Create the decoder model and print basic info
    decoder = config.decoder.create()
-    num_parameters = sum(p.numel() for p in decoder.parameters())
+    get_num_parameters = lambda model, only_training=False: sum(p.numel() for p in model.parameters() if (p.requires_grad or not only_training))

    # Create and initialize the tracker if we are the master
    tracker = create_tracker(accelerator, config, config_path, dummy = rank!=0)
@@ -570,7 +587,10 @@ def initialize_training(config: TrainDecoderConfig, config_path):
    accelerator.print(print_ribbon("Loaded Config", repeat=40))
    accelerator.print(f"Running training with {accelerator.num_processes} processes and {accelerator.distributed_type} distributed training")
    accelerator.print(f"Training using {data_source_string}. {'conditioned on text' if conditioning_on_text else 'not conditioned on text'}")
-    accelerator.print(f"Number of parameters: {num_parameters}")
+    accelerator.print(f"Number of parameters: {get_num_parameters(decoder)} total; {get_num_parameters(decoder, only_training=True)} training")
+    for i, unet in enumerate(decoder.unets):
+        accelerator.print(f"Unet {i} has {get_num_parameters(unet)} total; {get_num_parameters(unet, only_training=True)} training")
+
    train(dataloaders, decoder, accelerator,
        tracker=tracker,
        inference_device=accelerator.device,
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -126,9 +126,9 @@ def report_cosine_sims(

        # we are text conditioned, we produce an embedding from the tokenized text
        if text_conditioned:
-            text_embedding, text_encodings, text_mask = trainer.embed_text(text_data)
+            text_embedding, text_encodings = trainer.embed_text(text_data)
            text_cond = dict(
-                text_embed=text_embedding, text_encodings=text_encodings, mask=text_mask
+                text_embed=text_embedding, text_encodings=text_encodings
            )
        else:
            text_embedding = text_data
@@ -146,15 +146,12 @@ def report_cosine_sims(

        if text_conditioned:
            text_encodings_shuffled = text_encodings[rolled_idx]
-            text_mask_shuffled = text_mask[rolled_idx]
        else:
            text_encodings_shuffled = None
-            text_mask_shuffled = None

        text_cond_shuffled = dict(
            text_embed=text_embed_shuffled,
-            text_encodings=text_encodings_shuffled,
-            mask=text_mask_shuffled,
+            text_encodings=text_encodings_shuffled
        )

        # prepare the text embedding
Author	SHA1	Message	Date
Phil Wang	723bf0abba	complete inpainting ability using inpaint_image and inpaint_mask passed into sample function for decoder	2022-07-19 09:26:55 -07:00
Phil Wang	d88c7ba56c	fix a bug with ddim and predict x0 objective	2022-07-18 19:04:26 -07:00
Phil Wang	3676a8ce78	comments	2022-07-18 15:02:04 -07:00
Phil Wang	da8e99ada0	fix sample bug	2022-07-18 13:50:22 -07:00
Phil Wang	6afb886cf4	complete imagen-like noise level conditioning	2022-07-18 13:43:57 -07:00
Phil Wang	c7fe4f2f44	project management	2022-07-17 17:27:44 -07:00
Phil Wang	a2ee3fa3cc	offer way to turn off initial cross embed convolutional module, for debugging upsampler artifacts	2022-07-15 17:29:10 -07:00
Phil Wang	a58a370d75	takes care of a grad strides error at https://github.com/lucidrains/DALLE2-pytorch/issues/196 thanks to @YUHANG-Ma	2022-07-14 15:28:34 -07:00
Phil Wang	1662bbf226	protect against random cropping for base unet	2022-07-14 12:49:43 -07:00
Phil Wang	5be1f57448	update	2022-07-14 12:03:42 -07:00
Phil Wang	c52ce58e10	update	2022-07-14 10:54:51 -07:00
Phil Wang	a34f60962a	let the neural network peek at the low resolution conditioning one last time before making prediction, for upsamplers	2022-07-14 10:27:04 -07:00
Phil Wang	0b40cbaa54	just always use nearest neighbor interpolation when resizing for low resolution conditioning, for https://github.com/lucidrains/DALLE2-pytorch/pull/181	2022-07-13 20:59:43 -07:00
Phil Wang	f141144a6d	allow for using classifier free guidance for some unets but not others, by passing in a tuple of cond_scale during sampling for decoder, just in case it is causing issues for upsamplers	2022-07-13 13:12:30 -07:00
Phil Wang	f988207718	hack around some inplace error, also make sure for openai clip text encoding, only tokens after eos_id is masked out	2022-07-13 12:56:02 -07:00
Phil Wang	b2073219f0	foolproof sampling for decoder to always use eval mode (and restore training state afterwards)	2022-07-13 10:21:00 -07:00
Phil Wang	cc0f7a935c	fix non pixel shuffle upsample	2022-07-13 10:16:02 -07:00
Phil Wang	95a512cb65	fix a potential bug with conditioning with blurred low resolution image, blur should be applied only 50% of the time	2022-07-13 10:11:49 -07:00
Phil Wang	972ee973bc	fix issue with ddim and normalization of lowres conditioning image	2022-07-13 09:48:40 -07:00
Phil Wang	79e2a3bc77	only use the stable layernorm for final output norm in transformer	2022-07-13 07:56:30 -07:00
Aidan Dempster	544cdd0b29	Reverted to using basic dataloaders (#205 ) Accelerate removes the ability to collate strings. Likely since it cannot gather strings.	2022-07-12 18:22:27 -07:00
Phil Wang	349aaca56f	add yet another transformer stability measure	2022-07-12 17:49:16 -07:00
Phil Wang	3ee3c56d2a	add learned padding tokens, same strategy as dalle1, for diffusion prior, and get rid of masking in causal transformer	2022-07-12 17:33:14 -07:00
Phil Wang	cd26c6b17d	0.22.3	2022-07-12 17:08:31 -07:00
Phil Wang	775abc4df6	add setting to attend to all text encodings regardless of padding, for diffusion prior	2022-07-12 17:08:12 -07:00
Phil Wang	11b1d533a0	make sure text encodings being passed in has the correct batch dimension	2022-07-12 16:00:19 -07:00
Phil Wang	e76e89f9eb	remove text masking altogether in favor of deriving from text encodings (padded text encodings must be pad value of 0.)	2022-07-12 15:40:31 -07:00
Phil Wang	bb3ff0ac67	protect against bad text mask being passed into decoder	2022-07-12 15:33:13 -07:00
Phil Wang	1ec4dbe64f	one more fix for text mask, if the length of the text encoding exceeds max_text_len, add an assert for better error msg	2022-07-12 15:01:46 -07:00
Phil Wang	e0835acca9	generate text mask within the unet and diffusion prior itself from the text encodings, if not given	2022-07-12 12:54:59 -07:00
Phil Wang	e055793e5d	shoutout for @MalumaDev	2022-07-11 16:12:35 -07:00
Phil Wang	1d9ef99288	add PixelShuffleUpsample thanks to @MalumaDev and @marunine for running the experiment and verifyng absence of checkboard artifacts	2022-07-11 16:07:23 -07:00
Phil Wang	bdd62c24b3	zero init final projection in unet, since openai and @crowsonkb are both doing it	2022-07-11 13:22:06 -07:00
Phil Wang	1f1557c614	make it so even if text mask is omitted, it will be derived based on whether text encodings are all 0s or not, simplify dataloading	2022-07-11 10:56:19 -07:00
Aidan Dempster	1a217e99e3	Unet parameter count is now shown (#202 )	2022-07-10 16:45:59 -07:00
Phil Wang	7ea314e2f0	allow for final l2norm clamping of the sampled image embed	2022-07-10 09:44:38 -07:00
Phil Wang	4173e88121	more accurate readme	2022-07-09 20:57:26 -07:00
Phil Wang	3dae43fa0e	fix misnamed variable, thanks to @nousr	2022-07-09 19:01:37 -07:00
Phil Wang	a598820012	do not noise for the last step in ddim	2022-07-09 18:38:40 -07:00
Phil Wang	4878762627	fix for small validation bug for sampling steps	2022-07-09 17:31:54 -07:00
Phil Wang	47ae17b36e	more informative error for something that tripped me up	2022-07-09 17:28:14 -07:00
Phil Wang	b7e22f7da0	complete ddim integration of diffusion prior as well as decoder for each unet, feature complete for https://github.com/lucidrains/DALLE2-pytorch/issues/157	2022-07-09 17:25:34 -07:00
Romain Beaumont	68de937aac	Fix decoder test by fixing the resizing output size (#197 )	2022-07-09 07:48:07 -07:00
Phil Wang	097afda606	0.18.0	2022-07-08 18:18:38 -07:00
Aidan Dempster	5c520db825	Added deepspeed support (#195 )	2022-07-08 18:18:08 -07:00