make memory efficient unet design from imagen toggle-able

adopt similar unet architecture as imagen
add p2 loss reweighting for decoder training as an option
2026-02-12 11:34:29 +01:00 · 2022-06-15 13:40:26 -07:00 · 2022-06-15 12:18:21 -07:00 · 2022-06-14 10:58:57 -07:00 · 2022-06-13 21:01:50 -07:00 · 2022-06-07 17:31:38 -07:00
14 changed files with 261 additions and 101 deletions
--- a/README.md
+++ b/README.md
@@ -943,7 +943,7 @@ from dalle2_pytorch.dataloaders import ImageEmbeddingDataset, create_image_embed

 # Create a dataloader directly.
 dataloader = create_image_embedding_dataloader(
-    tar_url="/path/or/url/to/webdataset/{0000..9999}.tar", # Uses braket expanding notation. This specifies to read all tars from 0000.tar to 9999.tar
+    tar_url="/path/or/url/to/webdataset/{0000..9999}.tar", # Uses bracket expanding notation. This specifies to read all tars from 0000.tar to 9999.tar
    embeddings_url="path/or/url/to/embeddings/folder",     # Included if .npy files are not in webdataset. Left out or set to None otherwise
    num_workers=4,
    batch_size=32,
@@ -1097,7 +1097,7 @@ This library would not have gotten to this working state without the help of
 - [ ] test out grid attention in cascading ddpm locally, decide whether to keep or remove https://arxiv.org/abs/2204.01697
 - [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2
 - [ ] make sure FILIP works with DALL-E2 from x-clip https://arxiv.org/abs/2111.07783
- [ ] bring in skip-layer excitatons (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training
+- [ ] bring in skip-layer excitations (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training
 - [ ] decoder needs one day worth of refactor for tech debt
 - [ ] allow for unet to be able to condition non-cross attention style as well
 - [ ] read the paper, figure it out, and build it https://github.com/lucidrains/DALLE2-pytorch/issues/89
@@ -1207,4 +1207,14 @@ This library would not have gotten to this working state without the help of
 }
 ```

+```bibtex
+@article{Choi2022PerceptionPT,
+    title   = {Perception Prioritized Training of Diffusion Models},
+    author  = {Jooyoung Choi and Jungbeom Lee and Chaehun Shin and Sungwon Kim and Hyunwoo J. Kim and Sung-Hoon Yoon},
+    journal = {ArXiv},
+    year    = {2022},
+    volume  = {abs/2204.00227}
+}
+```
+
 *Creating noise from data is easy; creating data from noise is generative modeling.* - <a href="https://arxiv.org/abs/2011.13456">Yang Song's paper</a>
--- a/configs/README.md
+++ b/configs/README.md
@@ -83,7 +83,7 @@ Defines which evaluation metrics will be used to test the model.
 Each metric can be enabled by setting its configuration. The configuration keys for each metric are defined by the torchmetrics constructors which will be linked.
 | Option | Required | Default | Description |
 | ------ | -------- | ------- | ----------- |
-| `n_evalation_samples` | No | `1000` | The number of samples to generate to test the model. |
+| `n_evaluation_samples` | No | `1000` | The number of samples to generate to test the model. |
 | `FID` | No | `None` | Setting to an object enables the [Frechet Inception Distance](https://torchmetrics.readthedocs.io/en/stable/image/frechet_inception_distance.html) metric. 
 | `IS` | No | `None` | Setting to an object enables the [Inception Score](https://torchmetrics.readthedocs.io/en/stable/image/inception_score.html) metric.
 | `KID` | No | `None` | Setting to an object enables the [Kernel Inception Distance](https://torchmetrics.readthedocs.io/en/stable/image/kernel_inception_distance.html) metric. |
--- a/dalle2_pytorch/init.py
+++ b/dalle2_pytorch/init.py
@@ -1,3 +1,4 @@
+from dalle2_pytorch.version import __version__
 from dalle2_pytorch.dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
 from dalle2_pytorch.dalle2_pytorch import OpenAIClipAdapter
 from dalle2_pytorch.trainer import DecoderTrainer, DiffusionPriorTrainer
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -1,6 +1,6 @@
 import math
+import random
 from tqdm import tqdm
-from inspect import isfunction
 from functools import partial, wraps
 from contextlib import contextmanager
 from collections import namedtuple
@@ -11,7 +11,7 @@ import torch.nn.functional as F
 from torch import nn, einsum
 import torchvision.transforms as T

-from einops import rearrange, repeat
+from einops import rearrange, repeat, reduce
 from einops.layers.torch import Rearrange
 from einops_exts import rearrange_many, repeat_many, check_shape
 from einops_exts.torch import EinopsToAndFrom
@@ -56,7 +56,7 @@ def maybe(fn):
 def default(val, d):
    if exists(val):
        return val
-    return d() if isfunction(d) else d
+    return d() if callable(d) else d

 def cast_tuple(val, length = 1):
    if isinstance(val, list):
@@ -313,11 +313,6 @@ def extract(a, t, x_shape):
    out = a.gather(-1, t)
    return out.reshape(b, *((1,) * (len(x_shape) - 1)))

-def noise_like(shape, device, repeat=False):
-    repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1)))
-    noise = lambda: torch.randn(shape, device=device)
-    return repeat_noise() if repeat else noise()
-
 def meanflat(x):
    return x.mean(dim = tuple(range(1, len(x.shape))))

@@ -372,7 +367,7 @@ def quadratic_beta_schedule(timesteps):
    scale = 1000 / timesteps
    beta_start = scale * 0.0001
    beta_end = scale * 0.02
-    return torch.linspace(beta_start**2, beta_end**2, timesteps, dtype = torch.float64) ** 2
+    return torch.linspace(beta_start**0.5, beta_end**0.5, timesteps, dtype = torch.float64) ** 2


 def sigmoid_beta_schedule(timesteps):
@@ -384,7 +379,7 @@ def sigmoid_beta_schedule(timesteps):


 class BaseGaussianDiffusion(nn.Module):
-    def __init__(self, *, beta_schedule, timesteps, loss_type):
+    def __init__(self, *, beta_schedule, timesteps, loss_type, p2_loss_weight_gamma = 0., p2_loss_weight_k = 1):
        super().__init__()

        if beta_schedule == "cosine":
@@ -449,6 +444,11 @@ class BaseGaussianDiffusion(nn.Module):
        register_buffer('posterior_mean_coef1', betas * torch.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod))
        register_buffer('posterior_mean_coef2', (1. - alphas_cumprod_prev) * torch.sqrt(alphas) / (1. - alphas_cumprod))

+        # p2 loss reweighting
+
+        self.has_p2_loss_reweighting = p2_loss_weight_gamma > 0.
+        register_buffer('p2_loss_weight', (p2_loss_weight_k + alphas_cumprod / (1 - alphas_cumprod)) ** -p2_loss_weight_gamma)
+
    def q_posterior(self, x_start, x_t, t):
        posterior_mean = (
            extract(self.posterior_mean_coef1, t, x_t.shape) * x_start +
@@ -945,10 +945,10 @@ class DiffusionPrior(BaseGaussianDiffusion):
        return model_mean, posterior_variance, posterior_log_variance

    @torch.no_grad()
-    def p_sample(self, x, t, text_cond = None, clip_denoised = True, repeat_noise = False, cond_scale = 1.):
+    def p_sample(self, x, t, text_cond = None, clip_denoised = True, cond_scale = 1.):
        b, *_, device = *x.shape, x.device
        model_mean, _, model_log_variance = self.p_mean_variance(x = x, t = t, text_cond = text_cond, clip_denoised = clip_denoised, cond_scale = cond_scale)
-        noise = noise_like(x.shape, device, repeat_noise)
+        noise = torch.randn_like(x)
        # no noise when t == 0
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
@@ -1084,8 +1084,9 @@ class DiffusionPrior(BaseGaussianDiffusion):
 def Upsample(dim):
    return nn.ConvTranspose2d(dim, dim, 4, 2, 1)

-def Downsample(dim):
-    return nn.Conv2d(dim, dim, 4, 2, 1)
+def Downsample(dim, *, dim_out = None):
+    dim_out = default(dim_out, dim)
+    return nn.Conv2d(dim, dim_out, 4, 2, 1)

 class SinusoidalPosEmb(nn.Module):
    def __init__(self, dim):
@@ -1343,13 +1344,15 @@ class Unet(nn.Module):
        cond_on_text_encodings = False,
        max_text_len = 256,
        cond_on_image_embeds = False,
+        add_image_embeds_to_time = True, # alerted by @mhh0318 to a phrase in the paper - "Specifically, we modify the architecture described in Nichol et al. (2021) by projecting and adding CLIP embeddings to the existing timestep embedding"
        init_dim = None,
        init_conv_kernel_size = 7,
        resnet_groups = 8,
-        num_resnet_blocks = 1,
+        num_resnet_blocks = 2,
        init_cross_embed_kernel_sizes = (3, 7, 15),
        cross_embed_downsample = False,
        cross_embed_downsample_kernel_sizes = (2, 4),
+        memory_efficient = False,
        **kwargs
    ):
        super().__init__()
@@ -1369,7 +1372,7 @@ class Unet(nn.Module):
        self.channels_out = default(channels_out, channels)

        init_channels = channels if not lowres_cond else channels * 2 # in cascading diffusion, one concats the low resolution image, blurred, for conditioning the higher resolution synthesis
-        init_dim = default(init_dim, dim // 3 * 2)
+        init_dim = default(init_dim, dim)

        self.init_conv = CrossEmbedLayer(init_channels, dim_out = init_dim, kernel_sizes = init_cross_embed_kernel_sizes, stride = 1)

@@ -1396,11 +1399,16 @@ class Unet(nn.Module):
            nn.Linear(time_cond_dim, time_cond_dim)
        )

-        self.image_to_cond = nn.Sequential(
+        self.image_to_tokens = nn.Sequential(
            nn.Linear(image_embed_dim, cond_dim * num_image_tokens),
            Rearrange('b (n d) -> b n d', n = num_image_tokens)
        ) if cond_on_image_embeds and image_embed_dim != cond_dim else nn.Identity()

+        self.to_image_hiddens = nn.Sequential(
+            nn.Linear(image_embed_dim, time_cond_dim),
+            nn.GELU()
+        ) if cond_on_image_embeds and add_image_embeds_to_time else None
+
        self.norm_cond = nn.LayerNorm(cond_dim)
        self.norm_mid_cond = nn.LayerNorm(cond_dim)

@@ -1421,6 +1429,7 @@ class Unet(nn.Module):
        # for classifier free guidance

        self.null_image_embed = nn.Parameter(torch.randn(1, num_image_tokens, cond_dim))
+        self.null_image_hiddens = nn.Parameter(torch.randn(1, time_cond_dim))

        self.max_text_len = max_text_len
        self.null_text_embed = nn.Parameter(torch.randn(1, max_text_len, cond_dim))
@@ -1454,10 +1463,11 @@ class Unet(nn.Module):
            layer_cond_dim = cond_dim if not is_first else None

            self.downs.append(nn.ModuleList([
-                ResnetBlock(dim_in, dim_out, time_cond_dim = time_cond_dim, groups = groups),
+                downsample_klass(dim_in, dim_out = dim_out) if memory_efficient else None,
+                ResnetBlock(dim_out if memory_efficient else dim_in, dim_out, time_cond_dim = time_cond_dim, groups = groups),
                Residual(LinearAttention(dim_out, **attn_kwargs)) if sparse_attn else nn.Identity(),
                nn.ModuleList([ResnetBlock(dim_out, dim_out, cond_dim = layer_cond_dim, time_cond_dim = time_cond_dim, groups = groups) for _ in range(layer_num_resnet_blocks)]),
-                downsample_klass(dim_out) if not is_last else nn.Identity()
+                downsample_klass(dim_out) if not is_last and not memory_efficient else None
            ]))

        mid_dim = dims[-1]
@@ -1466,7 +1476,9 @@ class Unet(nn.Module):
        self.mid_attn = EinopsToAndFrom('b c h w', 'b (h w) c', Residual(Attention(mid_dim, **attn_kwargs))) if attend_at_middle else None
        self.mid_block2 = ResnetBlock(mid_dim, mid_dim, cond_dim = cond_dim, time_cond_dim = time_cond_dim, groups = resnet_groups[-1])

-        for ind, ((dim_in, dim_out), groups, layer_num_resnet_blocks) in enumerate(zip(reversed(in_out[1:]), reversed(resnet_groups), reversed(num_resnet_blocks))):
+        up_in_out_slice = slice(1 if not memory_efficient else None, None)
+
+        for ind, ((dim_in, dim_out), groups, layer_num_resnet_blocks) in enumerate(zip(reversed(in_out[up_in_out_slice]), reversed(resnet_groups), reversed(num_resnet_blocks))):
            is_last = ind >= (num_resolutions - 2)
            layer_cond_dim = cond_dim if not is_last else None

@@ -1563,7 +1575,23 @@ class Unet(nn.Module):
        image_keep_mask = prob_mask_like((batch_size,), 1 - image_cond_drop_prob, device = device)
        text_keep_mask = prob_mask_like((batch_size,), 1 - text_cond_drop_prob, device = device)

-        image_keep_mask, text_keep_mask = rearrange_many((image_keep_mask, text_keep_mask), 'b -> b 1 1')
+        text_keep_mask = rearrange(text_keep_mask, 'b -> b 1 1')
+
+        # image embedding to be summed to time embedding
+        # discovered by @mhh0318 in the paper
+
+        if exists(image_embed) and exists(self.to_image_hiddens):
+            image_hiddens = self.to_image_hiddens(image_embed)
+            image_keep_mask_hidden = rearrange(image_keep_mask, 'b -> b 1')
+            null_image_hiddens = self.null_image_hiddens.to(image_hiddens.dtype)
+
+            image_hiddens = torch.where(
+                image_keep_mask_hidden,
+                image_hiddens,
+                null_image_hiddens
+            )
+
+            t = t + image_hiddens

        # mask out image embedding depending on condition dropout
        # for classifier free guidance
@@ -1571,11 +1599,12 @@ class Unet(nn.Module):
        image_tokens = None

        if self.cond_on_image_embeds:
-            image_tokens = self.image_to_cond(image_embed)
+            image_keep_mask_embed = rearrange(image_keep_mask, 'b -> b 1 1')
+            image_tokens = self.image_to_tokens(image_embed)
            null_image_embed = self.null_image_embed.to(image_tokens.dtype) # for some reason pytorch AMP not working

            image_tokens = torch.where(
-                image_keep_mask,
+                image_keep_mask_embed,
                image_tokens,
                null_image_embed
            )
@@ -1630,7 +1659,10 @@ class Unet(nn.Module):

        hiddens = []

-        for init_block, sparse_attn, resnet_blocks, downsample in self.downs:
+        for pre_downsample, init_block, sparse_attn, resnet_blocks, post_downsample in self.downs:
+            if exists(pre_downsample):
+                x = pre_downsample(x)
+
            x = init_block(x, c, t)
            x = sparse_attn(x)

@@ -1638,7 +1670,9 @@ class Unet(nn.Module):
                x = resnet_block(x, c, t)

            hiddens.append(x)
-            x = downsample(x)
+
+            if exists(post_downsample):
+                x = post_downsample(x)

        x = self.mid_block1(x, mid_c, t)

@@ -1663,7 +1697,7 @@ class LowresConditioner(nn.Module):
    def __init__(
        self,
        downsample_first = True,
-        blur_sigma = 0.1,
+        blur_sigma = (0.1, 0.2),
        blur_kernel_size = 3,
    ):
        super().__init__()
@@ -1687,6 +1721,18 @@ class LowresConditioner(nn.Module):
            # when training, blur the low resolution conditional image
            blur_sigma = default(blur_sigma, self.blur_sigma)
            blur_kernel_size = default(blur_kernel_size, self.blur_kernel_size)
+
+            # allow for drawing a random sigma between lo and hi float values
+            if isinstance(blur_sigma, tuple):
+                blur_sigma = tuple(map(float, blur_sigma))
+                blur_sigma = random.uniform(*blur_sigma)
+
+            # allow for drawing a random kernel size between lo and hi int values
+            if isinstance(blur_kernel_size, tuple):
+                blur_kernel_size = tuple(map(int, blur_kernel_size))
+                kernel_size_lo, kernel_size_hi = blur_kernel_size
+                blur_kernel_size = random.randrange(kernel_size_lo, kernel_size_hi + 1)
+
            cond_fmap = gaussian_blur2d(cond_fmap, cast_tuple(blur_kernel_size, 2), cast_tuple(blur_sigma, 2))

        cond_fmap = resize_image_to(cond_fmap, target_image_size)
@@ -1712,23 +1758,28 @@ class Decoder(BaseGaussianDiffusion):
        image_sizes = None,                         # for cascading ddpm, image size at each stage
        random_crop_sizes = None,                   # whether to random crop the image at that stage in the cascade (super resoluting convolutions at the end may be able to generalize on smaller crops)
        lowres_downsample_first = True,             # cascading ddpm - resizes to lower resolution, then to next conditional resolution + blur
-        blur_sigma = 0.1,                           # cascading ddpm - blur sigma
+        blur_sigma = (0.1, 0.2),                    # cascading ddpm - blur sigma
        blur_kernel_size = 3,                       # cascading ddpm - blur kernel size
        condition_on_text_encodings = False,        # the paper suggested that this didn't do much in the decoder, but i'm allowing the option for experimentation
        clip_denoised = True,
        clip_x_start = True,
        clip_adapter_overrides = dict(),
        learned_variance = True,
+        learned_variance_constrain_frac = False,
        vb_loss_weight = 0.001,
        unconditional = False,
        auto_normalize_img = True,                  # whether to take care of normalizing the image from [0, 1] to [-1, 1] and back automatically - you can turn this off if you want to pass in the [-1, 1] ranged image yourself from the dataloader
        use_dynamic_thres = False,                  # from the Imagen paper
-        dynamic_thres_percentile = 0.9
+        dynamic_thres_percentile = 0.9,
+        p2_loss_weight_gamma = 0.,                  # p2 loss weight, from https://arxiv.org/abs/2204.00227 - 0 is equivalent to weight of 1 across time - 1. is recommended
+        p2_loss_weight_k = 1
    ):
        super().__init__(
            beta_schedule = beta_schedule,
            timesteps = timesteps,
-            loss_type = loss_type
+            loss_type = loss_type,
+            p2_loss_weight_gamma = p2_loss_weight_gamma,
+            p2_loss_weight_k = p2_loss_weight_k
        )

        self.unconditional = unconditional
@@ -1779,6 +1830,7 @@ class Decoder(BaseGaussianDiffusion):

        learned_variance = pad_tuple_to_length(cast_tuple(learned_variance), len(unets), fillvalue = False)
        self.learned_variance = learned_variance
+        self.learned_variance_constrain_frac = learned_variance_constrain_frac # whether to constrain the output of the network (the interpolation fraction) from 0 to 1
        self.vb_loss_weight = vb_loss_weight

        # construct unets and vaes
@@ -1919,16 +1971,19 @@ class Decoder(BaseGaussianDiffusion):
            max_log = extract(torch.log(self.betas), t, x.shape)
            var_interp_frac = unnormalize_zero_to_one(var_interp_frac_unnormalized)

+            if self.learned_variance_constrain_frac:
+                var_interp_frac = var_interp_frac.sigmoid()
+
            posterior_log_variance = var_interp_frac * max_log + (1 - var_interp_frac) * min_log
            posterior_variance = posterior_log_variance.exp()

        return model_mean, posterior_variance, posterior_log_variance

    @torch.no_grad()
-    def p_sample(self, unet, x, t, image_embed, text_encodings = None, text_mask = None, cond_scale = 1., lowres_cond_img = None, predict_x_start = False, learned_variance = False, clip_denoised = True, repeat_noise = False):
+    def p_sample(self, unet, x, t, image_embed, text_encodings = None, text_mask = None, cond_scale = 1., lowres_cond_img = None, predict_x_start = False, learned_variance = False, clip_denoised = True):
        b, *_, device = *x.shape, x.device
        model_mean, _, model_log_variance = self.p_mean_variance(unet, x = x, t = t, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = cond_scale, lowres_cond_img = lowres_cond_img, clip_denoised = clip_denoised, predict_x_start = predict_x_start, learned_variance = learned_variance)
-        noise = noise_like(x.shape, device, repeat_noise)
+        noise = torch.randn_like(x)
        # no noise when t == 0
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
@@ -1992,7 +2047,13 @@ class Decoder(BaseGaussianDiffusion):

        target = noise if not predict_x_start else x_start

-        loss = self.loss_fn(pred, target)
+        loss = self.loss_fn(pred, target, reduction = 'none')
+        loss = reduce(loss, 'b ... -> b (...)', 'mean')
+
+        if self.has_p2_loss_reweighting:
+            loss = loss * extract(self.p2_loss_weight, times, loss.shape)
+
+        loss = loss.mean()

        if not learned_variance:
            # return simple loss if not using learned variance
--- a/dalle2_pytorch/dataloaders/README.md
+++ b/dalle2_pytorch/dataloaders/README.md
@@ -15,7 +15,7 @@ from dalle2_pytorch.dataloaders import ImageEmbeddingDataset, create_image_embed

 # Create a dataloader directly.
 dataloader = create_image_embedding_dataloader(
-    tar_url="/path/or/url/to/webdataset/{0000..9999}.tar", # Uses braket expanding notation. This specifies to read all tars from 0000.tar to 9999.tar
+    tar_url="/path/or/url/to/webdataset/{0000..9999}.tar", # Uses bracket expanding notation. This specifies to read all tars from 0000.tar to 9999.tar
    embeddings_url="path/or/url/to/embeddings/folder",     # Included if .npy files are not in webdataset. Left out or set to None otherwise
    num_workers=4,
    batch_size=32,
--- a/dalle2_pytorch/optimizer.py
+++ b/dalle2_pytorch/optimizer.py
@@ -1,15 +1,17 @@
 from torch.optim import AdamW, Adam

 def separate_weight_decayable_params(params):
-    no_wd_params = set([param for param in params if param.ndim < 2])
-    wd_params = set(params) - no_wd_params
+    wd_params, no_wd_params = [], []
+    for param in params:
+        param_list = no_wd_params if param.ndim < 2 else wd_params
+        param_list.append(param)
    return wd_params, no_wd_params

 def get_optimizer(
    params,
    lr = 1e-4,
    wd = 1e-2,
-    betas = (0.9, 0.999),
+    betas = (0.9, 0.99),
    eps = 1e-8,
    filter_by_requires_grad = False,
    group_wd_params = True,
@@ -25,8 +27,8 @@ def get_optimizer(
        wd_params, no_wd_params = separate_weight_decayable_params(params)

        params = [
-            {'params': list(wd_params)},
-            {'params': list(no_wd_params), 'weight_decay': 0},
+            {'params': wd_params},
+            {'params': no_wd_params, 'weight_decay': 0},
        ]

    return AdamW(params, lr = lr, weight_decay = wd, betas = betas, eps = eps)
--- a/dalle2_pytorch/tokenizer.py
+++ b/dalle2_pytorch/tokenizer.py
@@ -2,7 +2,6 @@
 # to give users a quick easy start to training DALL-E without doing BPE

 import torch
-import youtokentome as yttm

 import html
 import os
@@ -11,6 +10,8 @@ import regex as re
 from functools import lru_cache
 from pathlib import Path

+from dalle2_pytorch.utils import import_or_print_error
+
 # OpenAI simple tokenizer

@lru_cache()
@@ -156,7 +157,9 @@ class YttmTokenizer:
        bpe_path = Path(bpe_path)
        assert bpe_path.exists(), f'BPE json path {str(bpe_path)} does not exist'

-        tokenizer = yttm.BPE(model = str(bpe_path))
+        self.yttm = import_or_print_error('youtokentome', 'you need to install youtokentome by `pip install youtokentome`')
+
+        tokenizer = self.yttm.BPE(model = str(bpe_path))
        self.tokenizer = tokenizer
        self.vocab_size = tokenizer.vocab_size()

@@ -167,7 +170,7 @@ class YttmTokenizer:
        return self.tokenizer.decode(tokens, ignore_ids = pad_tokens.union({0}))

    def encode(self, texts):
-        encoded = self.tokenizer.encode(texts, output_type = yttm.OutputType.ID)
+        encoded = self.tokenizer.encode(texts, output_type = self.yttm.OutputType.ID)
        return list(map(torch.tensor, encoded))

    def tokenize(self, texts, context_length = 256, truncate_text = False):
--- a/dalle2_pytorch/trackers.py
+++ b/dalle2_pytorch/trackers.py
@@ -6,6 +6,8 @@ from itertools import zip_longest
 import torch
 from torch import nn

+from dalle2_pytorch.utils import import_or_print_error
+
 # constants

 DEFAULT_DATA_PATH = './.tracker-data'
@@ -15,14 +17,6 @@ DEFAULT_DATA_PATH = './.tracker-data'
 def exists(val):
    return val is not None

-def import_or_print_error(pkg_name, err_str = None):
-    try:
-        return importlib.import_module(pkg_name)
-    except ModuleNotFoundError as e:
-        if exists(err_str):
-            print(err_str)
-        exit()
-
 # load state dict functions

 def load_wandb_state_dict(run_path, file_path, **kwargs):
--- a/dalle2_pytorch/trainer.py
+++ b/dalle2_pytorch/trainer.py
@@ -11,6 +11,8 @@ from torch.cuda.amp import autocast, GradScaler

 from dalle2_pytorch.dalle2_pytorch import Decoder, DiffusionPrior
 from dalle2_pytorch.optimizer import get_optimizer
+from dalle2_pytorch.version import __version__
+from packaging import version

 import numpy as np

@@ -56,9 +58,15 @@ def num_to_groups(num, divisor):
        arr.append(remainder)
    return arr

-def get_pkg_version():
-    from pkg_resources import get_distribution
-    return get_distribution('dalle2_pytorch').version
+def clamp(value, min_value = None, max_value = None):
+    assert exists(min_value) or exists(max_value)
+    if exists(min_value):
+        value = max(value, min_value)
+
+    if exists(max_value):
+        value = min(value, max_value)
+
+    return value

 # decorators

@@ -174,12 +182,34 @@ def save_diffusion_model(save_path, model, optimizer, scaler, config, image_embe
 # exponential moving average wrapper

 class EMA(nn.Module):
+    """
+    Implements exponential moving average shadowing for your model.
+
+    Utilizes an inverse decay schedule to manage longer term training runs.
+    By adjusting the power, you can control how fast EMA will ramp up to your specified beta.
+
+    @crowsonkb's notes on EMA Warmup:
+    
+    If gamma=1 and power=1, implements a simple average. gamma=1, power=2/3 are
+    good values for models you plan to train for a million or more steps (reaches decay
+    factor 0.999 at 31.6K steps, 0.9999 at 1M steps), gamma=1, power=3/4 for models
+    you plan to train for less (reaches decay factor 0.999 at 10K steps, 0.9999 at
+    215.4k steps).
+    
+    Args:
+        inv_gamma (float): Inverse multiplicative factor of EMA warmup. Default: 1.
+        power (float): Exponential factor of EMA warmup. Default: 1.
+        min_value (float): The minimum EMA decay rate. Default: 0.
+    """
    def __init__(
        self,
        model,
        beta = 0.9999,
-        update_after_step = 1000,
+        update_after_step = 10000,
        update_every = 10,
+        inv_gamma = 1.0,
+        power = 2/3,
+        min_value = 0.0,
    ):
        super().__init__()
        self.beta = beta
@@ -187,7 +217,11 @@ class EMA(nn.Module):
        self.ema_model = copy.deepcopy(model)

        self.update_every = update_every
-        self.update_after_step = update_after_step  // update_every # only start EMA after this step number, starting at 0
+        self.update_after_step = update_after_step
+
+        self.inv_gamma = inv_gamma
+        self.power = power
+        self.min_value = min_value

        self.register_buffer('initted', torch.Tensor([False]))
        self.register_buffer('step', torch.tensor([0]))
@@ -197,37 +231,51 @@ class EMA(nn.Module):
        self.ema_model.to(device)

    def copy_params_from_model_to_ema(self):
-        self.ema_model.state_dict(self.online_model.state_dict())
+        for ma_param, current_param in zip(list(self.ema_model.parameters()), list(self.online_model.parameters())):
+            ma_param.data.copy_(current_param.data)
+
+        for ma_buffer, current_buffer in zip(list(self.ema_model.buffers()), list(self.online_model.buffers())):
+            ma_buffer.data.copy_(current_buffer.data)
+
+    def get_current_decay(self):
+        epoch = clamp(self.step.item() - self.update_after_step - 1, min_value = 0)
+        value = 1 - (1 + epoch / self.inv_gamma) ** - self.power
+
+        if epoch <= 0:
+            return 0.
+
+        return clamp(value, min_value = self.min_value, max_value = self.beta)

    def update(self):
+        step = self.step.item()
        self.step += 1

-        if (self.step % self.update_every) != 0:
+        if (step % self.update_every) != 0:
            return

-        if self.step <= self.update_after_step:
+        if step <= self.update_after_step:
            self.copy_params_from_model_to_ema()
            return

-        if not self.initted:
+        if not self.initted.item():
            self.copy_params_from_model_to_ema()
            self.initted.data.copy_(torch.Tensor([True]))

        self.update_moving_average(self.ema_model, self.online_model)

+    @torch.no_grad()
    def update_moving_average(self, ma_model, current_model):
-        def calculate_ema(beta, old, new):
-            if not exists(old):
-                return new
-            return old * beta + (1 - beta) * new
+        current_decay = self.get_current_decay()

-        for current_params, ma_params in zip(current_model.parameters(), ma_model.parameters()):
-            old_weight, up_weight = ma_params.data, current_params.data
-            ma_params.data = calculate_ema(self.beta, old_weight, up_weight)
+        for current_params, ma_params in zip(list(current_model.parameters()), list(ma_model.parameters())):
+            difference = ma_params.data - current_params.data
+            difference.mul_(1.0 - current_decay)
+            ma_params.sub_(difference)

-        for current_buffer, ma_buffer in zip(current_model.buffers(), ma_model.buffers()):
-            new_buffer_value = calculate_ema(self.beta, ma_buffer, current_buffer)
-            ma_buffer.copy_(new_buffer_value)
+        for current_buffer, ma_buffer in zip(list(current_model.buffers()), list(ma_model.buffers())):
+            difference = ma_buffer - current_buffer
+            difference.mul_(1.0 - current_decay)
+            ma_buffer.sub_(difference)

    def __call__(self, *args, **kwargs):
        return self.ema_model(*args, **kwargs)
@@ -299,7 +347,7 @@ class DiffusionPriorTrainer(nn.Module):
            scaler = self.scaler.state_dict(),
            optimizer = self.optimizer.state_dict(),
            model = self.diffusion_prior.state_dict(),
-            version = get_pkg_version(),
+            version = __version__,
            step = self.step.item(),
            **kwargs
        )
@@ -315,8 +363,8 @@ class DiffusionPriorTrainer(nn.Module):

        loaded_obj = torch.load(str(path))

-        if get_pkg_version() != loaded_obj['version']:
-            print(f'loading saved diffusion prior at version {loaded_obj["version"]} but current package version is at {get_pkg_version()}')
+        if version.parse(__version__) != loaded_obj['version']:
+            print(f'loading saved diffusion prior at version {loaded_obj["version"]} but current package version is at {__version__}')

        self.diffusion_prior.load_state_dict(loaded_obj['model'], strict = strict)
        self.step.copy_(torch.ones_like(self.step) * loaded_obj['step'])
@@ -463,7 +511,7 @@ class DecoderTrainer(nn.Module):

        save_obj = dict(
            model = self.decoder.state_dict(),
-            version = get_pkg_version(),
+            version = __version__,
            step = self.step.item(),
            **kwargs
        )
@@ -486,8 +534,8 @@ class DecoderTrainer(nn.Module):

        loaded_obj = torch.load(str(path))

-        if get_pkg_version() != loaded_obj['version']:
-            print(f'loading saved decoder at version {loaded_obj["version"]}, but current package version is {get_pkg_version()}')
+        if version.parse(__version__) != loaded_obj['version']:
+            print(f'loading saved decoder at version {loaded_obj["version"]}, but current package version is {__version__}')

        self.decoder.load_state_dict(loaded_obj['model'], strict = strict)
        self.step.copy_(torch.ones_like(self.step) * loaded_obj['step'])
--- a/dalle2_pytorch/utils.py
+++ b/dalle2_pytorch/utils.py
@@ -17,3 +17,13 @@ class Timer:
 def print_ribbon(s, symbol = '=', repeat = 40):
    flank = symbol * repeat
    return f'{flank} {s} {flank}'
+
+# import helpers
+
+def import_or_print_error(pkg_name, err_str = None):
+    try:
+        return importlib.import_module(pkg_name)
+    except ModuleNotFoundError as e:
+        if exists(err_str):
+            print(err_str)
+        exit()
--- a/dalle2_pytorch/version.py
+++ b/dalle2_pytorch/version.py
@@ -0,0 +1 @@
+__version__ = '0.8.1'
--- a/setup.py
+++ b/setup.py
@@ -1,4 +1,5 @@
 from setuptools import setup, find_packages
+exec(open('dalle2_pytorch/version.py').read())

 setup(
  name = 'dalle2-pytorch',
@@ -10,7 +11,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.5.7',
+  version = __version__,
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
@@ -31,6 +32,7 @@ setup(
    'embedding-reader',
    'kornia>=0.5.4',
    'numpy',
+    'packaging',
    'pillow',
    'pydantic',
    'resize-right>=0.0.2',
@@ -40,7 +42,6 @@ setup(
    'tqdm',
    'vector-quantize-pytorch',
    'x-clip>=0.4.4',
-    'youtokentome',
    'webdataset>=0.2.5',
    'fsspec>=2022.1.0',
    'torchmetrics[image]>=0.8.0'
--- a/train_decoder.py
+++ b/train_decoder.py
@@ -4,6 +4,7 @@ from dalle2_pytorch.dataloaders import create_image_embedding_dataloader
 from dalle2_pytorch.trackers import WandbTracker, ConsoleTracker
 from dalle2_pytorch.train_configs import TrainDecoderConfig
 from dalle2_pytorch.utils import Timer, print_ribbon
+from dalle2_pytorch.dalle2_pytorch import resize_image_to

 import torchvision
 import torch
@@ -136,6 +137,14 @@ def generate_grid_samples(trainer, examples, text_prepend=""):
    Generates samples and uses torchvision to put them in a side by side grid for easy viewing
    """
    real_images, generated_images, captions = generate_samples(trainer, examples, text_prepend)
+
+    real_image_size = real_images[0].shape[-1]
+    generated_image_size = generated_images[0].shape[-1]
+
+    # training images may be larger than the generated one
+    if real_image_size > generated_image_size:
+        real_images = [resize_image_to(image, generated_image_size) for image in real_images]
+
    grid_images = [torchvision.utils.make_grid([original_image, generated_image]) for original_image, generated_image in zip(real_images, generated_images)]
    return grid_images, captions
                    
@@ -202,7 +211,7 @@ def recall_trainer(tracker, trainer, recall_source=None, **load_config):
    Loads the model with an appropriate method depending on the tracker
    """
    print(print_ribbon(f"Loading model from {recall_source}"))
-    state_dict = tracker.recall_state_dict(recall_source, **load_config)
+    state_dict = tracker.recall_state_dict(recall_source, **load_config.dict())
    trainer.load_state_dict(state_dict["trainer"])
    print("Model loaded")
    return state_dict["epoch"], state_dict["step"], state_dict["validation_losses"]
@@ -322,7 +331,7 @@ def train(
            sample = 0
            average_loss = 0
            timer = Timer()
-            for i, (img, emb, txt) in enumerate(dataloaders["val"]):
+            for i, (img, emb, *_) in enumerate(dataloaders["val"]):
                sample += img.shape[0]
                img, emb = send_to_device((img, emb))
                
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -7,15 +7,13 @@ import torch
 import clip
 from torch import nn

-from dalle2_pytorch.dataloaders import make_splits
+from dalle2_pytorch.dataloaders import make_splits, get_reader
 from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork, OpenAIClipAdapter
 from dalle2_pytorch.trainer import DiffusionPriorTrainer, load_diffusion_model, save_diffusion_model

 from dalle2_pytorch.trackers import ConsoleTracker, WandbTracker
 from dalle2_pytorch.utils import Timer, print_ribbon

-from embedding_reader import EmbeddingReader
-
 from tqdm import tqdm

 # constants
@@ -31,7 +29,7 @@ def exists(val):

 # functions

-def eval_model(model, dataloader, text_conditioned, loss_type, phase="Validation"):
+def eval_model(model, dataloader, text_conditioned, loss_type, device, phase="Validation",):
    model.eval()

    with torch.no_grad():
@@ -39,6 +37,8 @@ def eval_model(model, dataloader, text_conditioned, loss_type, phase="Validation
        total_samples = 0.

        for image_embeddings, text_data in tqdm(dataloader):
+            image_embeddings = image_embeddings.to(device)
+            text_data = text_data.to(device)

            batches = image_embeddings.shape[0]

@@ -57,12 +57,14 @@ def eval_model(model, dataloader, text_conditioned, loss_type, phase="Validation

        tracker.log({f'{phase} {loss_type}': avg_loss})

-def report_cosine_sims(diffusion_prior, dataloader, text_conditioned):
+def report_cosine_sims(diffusion_prior, dataloader, text_conditioned, device):
    diffusion_prior.eval()

    cos = nn.CosineSimilarity(dim=1, eps=1e-6)

    for test_image_embeddings, text_data in tqdm(dataloader):
+        test_image_embeddings = test_image_embeddings.to(device)
+        text_data = text_data.to(device)

        # we are text conditioned, we produce an embedding from the tokenized text
        if text_conditioned:
@@ -240,7 +242,7 @@ def train(
    # Training loop
    # diffusion prior network

-    prior_network = DiffusionPriorNetwork( 
+    prior_network = DiffusionPriorNetwork(
        dim = image_embed_dim,
        depth = dpn_depth,
        dim_head = dpn_dim_head,
@@ -249,16 +251,16 @@ def train(
        ff_dropout = dropout,
        normformer = dp_normformer
    )
-    
+
    # Load clip model if text-conditioning
    if dp_condition_on_text_encodings:
        clip_adapter = OpenAIClipAdapter(clip)
    else:
        clip_adapter = None
-        
+
    # diffusion prior with text embeddings and image embeddings pre-computed

-    diffusion_prior = DiffusionPrior( 
+    diffusion_prior = DiffusionPrior(
        net = prior_network,
        clip = clip_adapter,
        image_embed_dim = image_embed_dim,
@@ -296,28 +298,46 @@ def train(

    # Utilize wrapper to abstract away loader logic
    print_ribbon("Downloading Embeddings")
-    loader_args = dict(text_conditioned=dp_condition_on_text_encodings, batch_size=batch_size, num_data_points=num_data_points,
-                       train_split=train_percent, eval_split=val_percent, device=device, img_url=image_embed_url)
+    reader_args = dict(text_conditioned=dp_condition_on_text_encodings, img_url=image_embed_url)

    if dp_condition_on_text_encodings:
-        loader_args = dict(**loader_args, meta_url=meta_url)
+        reader_args = dict(**reader_args, meta_url=meta_url)
+        img_reader = get_reader(**reader_args)
+        train_loader, eval_loader, test_loader = make_splits(
+            text_conditioned=dp_condition_on_text_encodings,
+            batch_size=batch_size,
+            num_data_points=num_data_points,
+            train_split=train_percent,
+            eval_split=val_percent,
+            image_reader=img_reader
+            )
    else:
-        loader_args = dict(**loader_args, txt_url=text_embed_url)
-
-    train_loader, eval_loader, test_loader = make_splits(**loader_args)
+        reader_args = dict(**reader_args, txt_url=text_embed_url)
+        img_reader, txt_reader = get_reader(**reader_args)
+        train_loader, eval_loader, test_loader = make_splits(
+            text_conditioned=dp_condition_on_text_encodings,
+            batch_size=batch_size,
+            num_data_points=num_data_points,
+            train_split=train_percent,
+            eval_split=val_percent,
+            image_reader=img_reader,
+            text_reader=txt_reader
+            )

    ### Training code ###

-    step = 1 
+    step = 1
    timer = Timer()
    epochs = num_epochs

    for _ in range(epochs):

        for image, text in tqdm(train_loader):
-            
            diffusion_prior.train()
-            
+
+            image = image.to(device)
+            text = text.to(device)
+
            input_args = dict(image_embed=image)
            if dp_condition_on_text_encodings:
                input_args = dict(**input_args, text = text)
@@ -350,9 +370,9 @@ def train(
            # Use NUM_TEST_EMBEDDINGS samples from the test set each time
            # Get embeddings from the most recently saved model
            if(step % REPORT_METRICS_EVERY) == 0:
-                report_cosine_sims(diffusion_prior, eval_loader, dp_condition_on_text_encodings)
+                report_cosine_sims(diffusion_prior, eval_loader, dp_condition_on_text_encodings, device=device)
                ### Evaluate model(validation run) ###
-                eval_model(diffusion_prior, eval_loader, dp_condition_on_text_encodings, dp_loss_type, phase="Validation")
+                eval_model(diffusion_prior, eval_loader, dp_condition_on_text_encodings, dp_loss_type, phase="Validation", device=device)

            step += 1
            trainer.update()
Author	SHA1	Message	Date
Phil Wang	b7f9607258	make memory efficient unet design from imagen toggle-able	2022-06-15 13:40:26 -07:00
Phil Wang	2219348a6e	adopt similar unet architecture as imagen	2022-06-15 12:18:21 -07:00
Phil Wang	9eea9b9862	add p2 loss reweighting for decoder training as an option	2022-06-14 10:58:57 -07:00
Phil Wang	5d958713c0	fix classifier free guidance for image hiddens summed to time hiddens, thanks to @xvjiarui for finding this bug	2022-06-13 21:01:50 -07:00
Phil Wang	0f31980362	cleanup	2022-06-07 17:31:38 -07:00
Phil Wang	bee5bf3815	fix for https://github.com/lucidrains/DALLE2-pytorch/issues/143	2022-06-07 09:03:48 -07:00
Phil Wang	350a3d6045	0.6.16	2022-06-06 08:45:46 -07:00
Kashif Rasul	1a81670718	fix quadratic_beta_schedule (#141 )	2022-06-06 08:45:14 -07:00
Phil Wang	934c9728dc	some cleanup	2022-06-04 16:54:15 -07:00
Phil Wang	ce4b0107c1	0.6.13	2022-06-04 13:26:57 -07:00
zion	64c2f9c4eb	implement ema warmup from @crowsonkb (#140 )	2022-06-04 13:26:34 -07:00
Phil Wang	22cc613278	ema fix from @nousr	2022-06-03 19:44:36 -07:00
zion	83517849e5	ema module fixes (#139 )	2022-06-03 19:43:51 -07:00
Phil Wang	708809ed6c	lower beta2 for adam down to 0.99, based on https://openreview.net/forum?id=2LdBqxc1Yv	2022-06-03 10:26:28 -07:00
Phil Wang	9cc475f6e7	fix update_every within EMA	2022-06-03 10:21:05 -07:00
Phil Wang	ffd342e9d0	allow for an option to constrain the variance interpolation fraction coming out from the unet for learned variance, if it is turned on	2022-06-03 09:34:57 -07:00
Phil Wang	f8bfd3493a	make destructuring datum length agnostic when validating in training decoder script, for @YUHANG-Ma	2022-06-02 13:54:57 -07:00
Phil Wang	9025345e29	take a stab at fixing generate_grid_samples when real images have a greater image size than generated	2022-06-02 11:33:15 -07:00
Phil Wang	8cc278447e	just cast to right types for blur sigma and kernel size augs	2022-06-02 11:21:58 -07:00
Phil Wang	38cd62010c	allow for random blur sigma and kernel size augmentations on low res conditioning (need to reread paper to see if the augmentation value needs to be fed into the unet for conditioning as well)	2022-06-02 11:11:25 -07:00
Ryan Russell	1cc288af39	Improve Readability (#133 ) Signed-off-by: Ryan Russell <git@ryanrussell.org>	2022-06-01 13:28:02 -07:00
Phil Wang	a851168633	make youtokentome optional package, due to reported installation difficulties	2022-06-01 09:25:35 -07:00
Phil Wang	1ffeecd0ca	lower default ema beta value	2022-05-31 11:55:21 -07:00
Phil Wang	3df899f7a4	patch	2022-05-31 09:03:43 -07:00
Aidan Dempster	09534119a1	Fixed non deterministic optimizer creation (#130 )	2022-05-31 09:03:20 -07:00
Phil Wang	6f8b90d4d7	add packaging package	2022-05-30 11:45:00 -07:00
Phil Wang	b588286288	fix version	2022-05-30 11:06:34 -07:00
Phil Wang	b693e0be03	default number of resnet blocks per layer in unet to 2 (in imagen it was 3 for base 64x64)	2022-05-30 10:06:48 -07:00
Phil Wang	a0bed30a84	additional conditioning on image embedding by summing to time embeddings (for FiLM like conditioning in subsequent layers), from passage found in paper by @mhh0318	2022-05-30 09:26:51 -07:00
zion	387c5bf774	quick patch for new prior loader (#123 )	2022-05-29 16:25:53 -07:00