just use an assert to make sure clip image channels is never different than the channels of the diffusion prior and decoder, if clip is given

fix and cleanup image size determination logic in decoder
allow for saving of additional fields on save method in trainers, and return loaded objects from the load method
2026-02-12 11:34:29 +01:00 · 2022-05-22 22:34:33 -07:00 · 2022-05-22 22:28:45 -07:00 · 2022-05-22 22:14:25 -07:00 · 2022-05-22 20:36:20 -07:00 · 2022-05-22 19:23:40 -07:00
8 changed files with 125 additions and 54 deletions
--- a/README.md
+++ b/README.md
@@ -24,6 +24,11 @@ There was enough interest for a <a href="https://github.com/lucidrains/dalle2-ja

 *ongoing at 21k steps*

+## Pre-Trained Models
+- LAION is training prior models. Checkpoints are available on <a href="https://huggingface.co/zenglishuci/conditioned-prior">🤗huggingface</a> and the training statistics are available on <a href="https://wandb.ai/nousr_laion/conditioned-prior/reports/LAION-DALLE2-PyTorch-Prior--VmlldzoyMDI2OTIx">🐝WANDB</a>.
+- Decoder 🚧
+- DALL-E 2 🚧
+
 ## Install

 ```bash
@@ -1079,6 +1084,7 @@ This library would not have gotten to this working state without the help of
 - [x] use pydantic for config drive training
 - [x] for both diffusion prior and decoder, all exponential moving averaged models needs to be saved and restored as well (as well as the step number)
 - [x] offer save / load methods on the trainer classes to automatically take care of state dicts for scalers / optimizers / saving versions and checking for breaking changes
+- [x] allow for creation of diffusion prior model off pydantic config classes - consider the same for tracker configs
 - [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo) - consider https://github.com/lucidrains/uformer-pytorch attention-based unet
 - [ ] transcribe code to Jax, which lowers the activation energy for distributed training, given access to TPUs
 - [ ] train on a toy task, offer in colab
--- a/configs/README.md
+++ b/configs/README.md
@@ -6,9 +6,10 @@ For more complex configuration, we provide the option of using a configuration f

 The decoder trainer has 7 main configuration options. A full example of their use can be found in the [example decoder configuration](train_decoder_config.example.json).

-**<ins>Unets</ins>:**
+**<ins>Unet</ins>:**
+
+This is a single unet config, which belongs as an array nested under the decoder config as a list of `unets`

-Each member of this array defines a single unet that will be added to the decoder.
 | Option | Required | Default | Description |
 | ------ | -------- | ------- | ----------- |
 | `dim`  | Yes      | N/A     | The starting channels of the unet. |
@@ -22,6 +23,7 @@ Any parameter from the `Unet` constructor can also be given here.
 Defines the configuration options for the decoder model. The unets defined above will automatically be inserted.
 | Option | Required | Default | Description |
 | ------ | -------- | ------- | ----------- |
+| `unets` | Yes | N/A | A list of unets, using the configuration above |
 | `image_sizes` | Yes | N/A | The resolution of the image after each upsampling step. The length of this array should be the number of unets defined. |
 | `image_size` | Yes | N/A | Not used. Can be any number. |
 | `timesteps` | No | `1000` | The number of diffusion timesteps used for generation. |
--- a/configs/train_decoder_config.example.json
+++ b/configs/train_decoder_config.example.json
@@ -1,16 +1,16 @@
 {
-    "unets": [
-        {
-            "dim": 128,
-            "image_embed_dim": 768,
-            "cond_dim": 64,
-            "channels": 3,
-            "dim_mults": [1, 2, 4, 8],
-            "attn_dim_head": 32,
-            "attn_heads": 16
-        }
-    ],
    "decoder": {
+        "unets": [
+            {
+                "dim": 128,
+                "image_embed_dim": 768,
+                "cond_dim": 64,
+                "channels": 3,
+                "dim_mults": [1, 2, 4, 8],
+                "attn_dim_head": 32,
+                "attn_heads": 16
+            }
+        ],
        "image_sizes": [64],
        "channels": 3,
        "timesteps": 1000,
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -890,6 +890,8 @@ class DiffusionPrior(BaseGaussianDiffusion):
        )

        if exists(clip):
+            assert image_channels == clip.image_channels, f'channels of image ({channels}) should be equal to the channels that CLIP accepts ({clip.image_channels})'
+
            if isinstance(clip, CLIP):
                clip = XClipAdapter(clip, **clip_adapter_overrides)
            elif isinstance(clip, CoCa):
@@ -1710,12 +1712,19 @@ class Decoder(BaseGaussianDiffusion):
        )

        self.unconditional = unconditional
-        assert not (condition_on_text_encodings and unconditional), 'unconditional decoder image generation cannot be set to True if conditioning on text is present'

-        assert self.unconditional or (exists(clip) ^ exists(image_size)), 'either CLIP is supplied, or you must give the image_size and channels (usually 3 for RGB)'
+        # text conditioning
+
+        assert not (condition_on_text_encodings and unconditional), 'unconditional decoder image generation cannot be set to True if conditioning on text is present'
+        self.condition_on_text_encodings = condition_on_text_encodings
+
+        # clip

        self.clip = None
        if exists(clip):
+            assert not unconditional, 'clip must not be given if doing unconditional image training'
+            assert channels == clip.image_channels, f'channels of image ({channels}) should be equal to the channels that CLIP accepts ({clip.image_channels})'
+
            if isinstance(clip, CLIP):
                clip = XClipAdapter(clip, **clip_adapter_overrides)
            elif isinstance(clip, CoCa):
@@ -1725,13 +1734,20 @@ class Decoder(BaseGaussianDiffusion):
            assert isinstance(clip, BaseClipAdapter)

            self.clip = clip
-            self.clip_image_size = clip.image_size
-            self.channels = clip.image_channels
-        else:
-            self.clip_image_size = image_size
-            self.channels = channels

-        self.condition_on_text_encodings = condition_on_text_encodings
+        # determine image size, with image_size and image_sizes taking precedence
+
+        if exists(image_size) or exists(image_sizes):
+            assert exists(image_size) ^ exists(image_sizes), 'only one of image_size or image_sizes must be given'
+            image_size = default(image_size, lambda: image_sizes[-1])
+        elif exists(clip):
+            image_size = clip.image_size
+        else:
+            raise Error('either image_size, image_sizes, or clip must be given to decoder')
+
+        # channels
+
+        self.channels = channels

        # automatically take care of ensuring that first unet is unconditional
        # while the rest of the unets are conditioned on the low resolution image produced by previous unet
@@ -1773,7 +1789,7 @@ class Decoder(BaseGaussianDiffusion):

        # unet image sizes

-        image_sizes = default(image_sizes, (self.clip_image_size,))
+        image_sizes = default(image_sizes, (image_size,))
        image_sizes = tuple(sorted(set(image_sizes)))

        assert len(self.unets) == len(image_sizes), f'you did not supply the correct number of u-nets ({len(self.unets)}) for resolutions {image_sizes}'
@@ -1811,6 +1827,7 @@ class Decoder(BaseGaussianDiffusion):
        self.clip_x_start = clip_x_start

        # normalize and unnormalize image functions
+
        self.normalize_img = normalize_neg_one_to_one if auto_normalize_img else identity
        self.unnormalize_img = unnormalize_zero_to_one if auto_normalize_img else identity

--- a/dalle2_pytorch/train_configs.py
+++ b/dalle2_pytorch/train_configs.py
@@ -3,15 +3,61 @@ from torchvision import transforms as T
 from pydantic import BaseModel, validator, root_validator
 from typing import List, Iterable, Optional, Union, Tuple, Dict, Any

+from dalle2_pytorch.dalle2_pytorch import Unet, Decoder, DiffusionPrior, DiffusionPriorNetwork
+
+# helper functions
+
 def exists(val):
    return val is not None

 def default(val, d):
    return val if exists(val) else d

+def ListOrTuple(inner_type):
+    return Union[List[inner_type], Tuple[inner_type]]
+
+# pydantic classes
+
+class DiffusionPriorNetworkConfig(BaseModel):
+    dim: int
+    depth: int
+    num_timesteps: int = None
+    num_time_embeds: int = 1
+    num_image_embeds: int = 1
+    num_text_embeds: int = 1
+    dim_head: int = 64
+    heads: int = 8
+    ff_mult: int = 4
+    norm_out: bool = True
+    attn_dropout: float = 0.
+    ff_dropout: float = 0.
+    final_proj: bool = True
+    normformer: bool = False
+    rotary_emb: bool = True
+
+class DiffusionPriorConfig(BaseModel):
+    # only clip-less diffusion prior config for now
+    net: DiffusionPriorNetworkConfig
+    image_embed_dim: int
+    image_size: int
+    image_channels: int = 3
+    timesteps: int = 1000
+    cond_drop_prob: float = 0.
+    loss_type: str = 'l2'
+    predict_x_start: bool = True
+    beta_schedule: str = 'cosine'
+
+    def create(self):
+        kwargs = self.dict()
+        diffusion_prior_network = DiffusionPriorNetwork(**kwargs.pop('net'))
+        return DiffusionPrior(net = diffusion_prior_network, **kwargs)
+
+    class Config:
+        extra = "allow"
+
 class UnetConfig(BaseModel):
    dim: int
-    dim_mults: List[int]
+    dim_mults: ListOrTuple(int)
    image_embed_dim: int = None
    cond_dim: int = None
    channels: int = 3
@@ -22,13 +68,22 @@ class UnetConfig(BaseModel):
        extra = "allow"

 class DecoderConfig(BaseModel):
+    unets: ListOrTuple(UnetConfig)
    image_size: int = None
-    image_sizes: Union[List[int], Tuple[int]] = None
+    image_sizes: ListOrTuple(int) = None
    channels: int = 3
    timesteps: int = 1000
    loss_type: str = 'l2'
    beta_schedule: str = 'cosine'
    learned_variance: bool = True
+    image_cond_drop_prob: float = 0.1
+    text_cond_drop_prob: float = 0.5
+
+    def create(self):
+        decoder_kwargs = self.dict()
+        unet_configs = decoder_kwargs.pop('unets')
+        unets = [Unet(**config) for config in unet_configs]
+        return Decoder(unets, **decoder_kwargs)

    @validator('image_sizes')
    def check_image_sizes(cls, image_sizes, values):
@@ -86,17 +141,17 @@ class DecoderTrainConfig(BaseModel):
    wd: float = 0.01
    max_grad_norm: float = 0.5
    save_every_n_samples: int = 100000
-    n_sample_images: int = 6                # The number of example images to produce when sampling the train and test dataset
+    n_sample_images: int = 6                       # The number of example images to produce when sampling the train and test dataset
    device: str = 'cuda:0'
-    epoch_samples: int = None               # Limits the number of samples per epoch. None means no limit. Required if resample_train is true as otherwise the number of samples per epoch is infinite.
-    validation_samples: int = None          # Same as above but for validation.
+    epoch_samples: int = None                      # Limits the number of samples per epoch. None means no limit. Required if resample_train is true as otherwise the number of samples per epoch is infinite.
+    validation_samples: int = None                 # Same as above but for validation.
    use_ema: bool = True
    ema_beta: float = 0.99
    amp: bool = False
-    save_all: bool = False                  # Whether to preserve all checkpoints
-    save_latest: bool = True                # Whether to always save the latest checkpoint
-    save_best: bool = True                  # Whether to save the best checkpoint
-    unet_training_mask: List[bool] = None   # If None, use all unets
+    save_all: bool = False                         # Whether to preserve all checkpoints
+    save_latest: bool = True                       # Whether to always save the latest checkpoint
+    save_best: bool = True                         # Whether to save the best checkpoint
+    unet_training_mask: ListOrTuple(bool) = None   # If None, use all unets

 class DecoderEvaluateConfig(BaseModel):
    n_evaluation_samples: int = 1000
@@ -120,7 +175,6 @@ class DecoderLoadConfig(BaseModel):
    resume: bool = False                    # If using wandb, whether to resume the run

 class TrainDecoderConfig(BaseModel):
-    unets: List[UnetConfig]
    decoder: DecoderConfig
    data: DecoderDataConfig
    train: DecoderTrainConfig
--- a/dalle2_pytorch/trainer.py
+++ b/dalle2_pytorch/trainer.py
@@ -288,7 +288,7 @@ class DiffusionPriorTrainer(nn.Module):

        self.register_buffer('step', torch.tensor([0]))

-    def save(self, path, overwrite = True):
+    def save(self, path, overwrite = True, **kwargs):
        path = Path(path)
        assert not (path.exists() and not overwrite)
        path.parent.mkdir(parents = True, exist_ok = True)
@@ -298,7 +298,8 @@ class DiffusionPriorTrainer(nn.Module):
            optimizer = self.optimizer.state_dict(),
            model = self.diffusion_prior.state_dict(),
            version = get_pkg_version(),
-            step = self.step.item()
+            step = self.step.item(),
+            **kwargs
        )

        if self.use_ema:
@@ -319,7 +320,7 @@ class DiffusionPriorTrainer(nn.Module):
        self.step.copy_(torch.ones_like(self.step) * loaded_obj['step'])

        if only_model:
-            return
+            return loaded_obj

        self.scaler.load_state_dict(loaded_obj['scaler'])
        self.optimizer.load_state_dict(loaded_obj['optimizer'])
@@ -328,6 +329,8 @@ class DiffusionPriorTrainer(nn.Module):
            assert 'ema' in loaded_obj
            self.ema_diffusion_prior.load_state_dict(loaded_obj['ema'], strict = strict)

+        return loaded_obj
+
    def update(self):
        if exists(self.max_grad_norm):
            self.scaler.unscale_(self.optimizer)
@@ -449,7 +452,7 @@ class DecoderTrainer(nn.Module):

        self.register_buffer('step', torch.tensor([0.]))

-    def save(self, path, overwrite = True):
+    def save(self, path, overwrite = True, **kwargs):
        path = Path(path)
        assert not (path.exists() and not overwrite)
        path.parent.mkdir(parents = True, exist_ok = True)
@@ -457,7 +460,8 @@ class DecoderTrainer(nn.Module):
        save_obj = dict(
            model = self.decoder.state_dict(),
            version = get_pkg_version(),
-            step = self.step.item()
+            step = self.step.item(),
+            **kwargs
        )

        for ind in range(0, self.num_unets):
@@ -485,7 +489,7 @@ class DecoderTrainer(nn.Module):
        self.step.copy_(torch.ones_like(self.step) * loaded_obj['step'])

        if only_model:
-            return
+            return loaded_obj

        for ind in range(0, self.num_unets):
            scaler_key = f'scaler{ind}'
@@ -500,6 +504,8 @@ class DecoderTrainer(nn.Module):
            assert 'ema' in loaded_obj
            self.ema_unets.load_state_dict(loaded_obj['ema'], strict = strict)

+        return loaded_obj
+
    @property
    def unets(self):
        return nn.ModuleList([ema.ema_model for ema in self.ema_unets])
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.4.7',
+  version = '0.4.12',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
--- a/train_decoder.py
+++ b/train_decoder.py
@@ -85,20 +85,6 @@ def create_dataloaders(
        "test_sampling": test_sampling_dataloader
    }

-
-def create_decoder(device, decoder_config, unets_config):
-    """Creates a sample decoder"""
-
-    unets = [Unet(**config.dict()) for config in unets_config]
-
-    decoder = Decoder(
-        unet=unets,
-        **decoder_config.dict()
-    )
-
-    decoder.to(device=device)
-    return decoder
-
 def get_dataset_keys(dataloader):
    """
    It is sometimes neccesary to get the keys the dataloader is returning. Since the dataset is burried in the dataloader, we need to do a process to recover it.
@@ -428,7 +414,7 @@ def initialize_training(config):
        **config.data.dict()
    )

-    decoder = create_decoder(device, config.decoder, config.unets)
+    decoder = config.decoder.create().to(device = device)
    num_parameters = sum(p.numel() for p in decoder.parameters())
    print(print_ribbon("Loaded Config", repeat=40))
    print(f"Number of parameters: {num_parameters}")
Author	SHA1	Message	Date
Phil Wang	a0e41267f8	just use an assert to make sure clip image channels is never different than the channels of the diffusion prior and decoder, if clip is given	2022-05-22 22:34:33 -07:00
Phil Wang	276abf337b	fix and cleanup image size determination logic in decoder	2022-05-22 22:28:45 -07:00
Phil Wang	ae42d03006	allow for saving of additional fields on save method in trainers, and return loaded objects from the load method	2022-05-22 22:14:25 -07:00
Phil Wang	4d346e98d9	allow for config driven creation of clip-less diffusion prior	2022-05-22 20:36:20 -07:00
Phil Wang	2b1fd1ad2e	product management	2022-05-22 19:23:40 -07:00
zion	82a2ef37d9	Update README.md (#109 ) block in a section that links to available pre-trained models for those who are interested	2022-05-22 19:22:30 -07:00
Phil Wang	5c397c9d66	move neural network creations off the configuration file into the pydantic classes	2022-05-22 19:18:18 -07:00