allow for overriding use of EMA during sampling in decoder trainer with use_non_ema keyword, also fix some issues with automatic normalization of images and low res conditioning image if latent diffusion is in play

allow text encodings and text mask to be passed in on forward and sampling for Decoder class
back to no_grad for now, also keep track and restore unet devices in one_unet_in_gpu contextmanager
2026-02-12 19:44:26 +01:00 · 2022-05-16 11:18:30 -07:00 · 2022-05-16 10:40:32 -07:00 · 2022-05-16 09:36:14 -07:00 · 2022-05-16 09:17:17 -07:00 · 2022-05-15 20:16:38 -07:00
10 changed files with 663 additions and 341 deletions
--- a/README.md
+++ b/README.md
@@ -706,7 +706,7 @@ mock_image_embed = torch.randn(1, 512).cuda()
 images = decoder.sample(mock_image_embed) # (1, 3, 1024, 1024)
 ```

-## Training wrapper (wip)
+## Training wrapper

 ### Decoder Training

@@ -851,6 +851,57 @@ diffusion_prior_trainer.update()  # this will update the optimizer as well as th
 image_embeds = diffusion_prior_trainer.sample(text) # (4, 512) - exponential moving averaged image embeddings
 ```

+## Bonus
+
+### Unconditional Training
+
+The repository also contains the means to train unconditional DDPM model, or even cascading DDPMs. You simply have to set `unconditional = True` in the `Decoder`
+
+ex.
+
+```python
+import torch
+from dalle2_pytorch import Unet, Decoder
+
+# unet for the cascading ddpm
+
+unet1 = Unet(
+    dim = 128,
+    dim_mults=(1, 2, 4, 8)
+).cuda()
+
+unet2 = Unet(
+    dim = 32,
+    dim_mults = (1, 2, 4, 8, 16)
+).cuda()
+
+# decoder, which contains the unets
+
+decoder = Decoder(
+    unet = (unet1, unet2),
+    image_sizes = (256, 512),  # first unet up to 256px, then second to 512px
+    timesteps = 1000,
+    unconditional = True
+).cuda()
+
+# mock images (get a lot of this)
+
+images = torch.randn(1, 3, 512, 512).cuda()
+
+# feed images into decoder
+
+for i in (1, 2):
+    loss = decoder(images, unet_number = i)
+    loss.backward()
+
+# do the above for many many many many steps
+# then it will learn to generate images
+
+images = decoder.sample(batch_size = 2) # (2, 3, 512, 512)
+```
+
+## Dataloaders
+
 ### Decoder Dataloaders

 In order to make loading data simple and efficient, we include some general dataloaders that can be used to train portions of the network.
@@ -895,14 +946,14 @@ dataset = ImageEmbeddingDataset(
 )
 ```

-## Scripts
+### Scripts (wip)

-### Using the `train_diffusion_prior.py` script
+#### `train_diffusion_prior.py`

 This script allows training the DiffusionPrior on pre-computed text and image embeddings. The working example below elucidates this process.
 Please note that the script internally passes text_embed and image_embed to the DiffusionPrior, unlike the example below.

-### Usage 
+#### Usage

 ```bash
 $ python train_diffusion_prior.py
@@ -910,58 +961,49 @@ $ python train_diffusion_prior.py

 The most significant parameters for the script are as follows:

--image-embed-url, default = "https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/")
+- `image-embed-url`, default = `"https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/"`

--text-embed-url, default = "https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/")
+- `text-embed-url`, default = `"https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/"`

--image-embed-dim, default=768 - 768 corresponds to the ViT iL/14 embedding size,change it to what your chosen ViT generates
+- `image-embed-dim`, default = `768` - 768 corresponds to the ViT iL/14 embedding size,change it to what your chosen ViT generates

--learning-rate, default=1.1e-4
+- `learning-rate`, default = `1.1e-4`

--weight-decay,  default=6.02e-2
+- `weight-decay`,  default = `6.02e-2`

--max-grad-norm, default=0.5
+- `max-grad-norm`, default = `0.5`

--batch-size, default=10 ** 4
+- `batch-size`, default = `10 ** 4`

--num-epochs, default=5
+- `num-epochs`, default = `5`

--clip, default=None # Signals the prior to use pre-computed embeddings
+- `clip`, default = `None` # Signals the prior to use pre-computed embeddings

-### Sample wandb run log
-
-Please find a sample wandb run log at : https://wandb.ai/laion/diffusion-prior/runs/1blxu24j
-
-### Loading and saving the Diffusion Prior model
+#### Loading and Saving the DiffusionPrior model

 Two methods are provided, load_diffusion_model and save_diffusion_model, the names being self-explanatory. 

-## from dalle2_pytorch.train import load_diffusion_model, save_diffusion_model
+```python
+from dalle2_pytorch.train import load_diffusion_model, save_diffusion_model
+```
+
+##### Loading

    load_diffusion_model(dprior_path, device) 
-
        dprior_path : path to saved model(.pth)
-    
        device      : the cuda device you're running on
    
+##### Saving
+
    save_diffusion_model(save_path, model, optimizer, scaler, config, image_embed_dim)
-    
        save_path : path to save at
-    
        model     : object of Diffusion_Prior
-    
        optimizer : optimizer object - see train_diffusion_prior.py for how to create one. 
-    
            e.g: optimizer = get_optimizer(diffusion_prior.net.parameters(), wd=weight_decay, lr=learning_rate)
-    
        scaler    : a GradScaler object.
-    
            e.g: scaler = GradScaler(enabled=amp)
-    
        config    : config object created in train_diffusion_prior.py - see file for example. 
-    
        image_embed_dim - the dimension of the image_embedding
-    
            e.g: 768

 ## CLI (wip)
@@ -1021,6 +1063,9 @@ Once built, images will be saved to the same directory the command is invoked
 - [ ] bring in skip-layer excitatons (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training
 - [ ] decoder needs one day worth of refactor for tech debt
 - [ ] allow for unet to be able to condition non-cross attention style as well
+- [ ] for all model classes with hyperparameters that changes the network architecture, make it requirement that they must expose a config property, and write a simple function that asserts that it restores the object correctly
+- [ ] for both diffusion prior and decoder, all exponential moving averaged models needs to be saved and restored as well (as well as the step number)
+- [ ] read the paper, figure it out, and build it https://github.com/lucidrains/DALLE2-pytorch/issues/89

 ## Citations

@@ -1109,4 +1154,13 @@ Once built, images will be saved to the same directory the command is invoked
 }
 ```

+```bibtex
+@article{ho2021cascaded,
+    title   = {Cascaded Diffusion Models for High Fidelity Image Generation},
+    author  = {Ho, Jonathan and Saharia, Chitwan and Chan, William and Fleet, David J and Norouzi, Mohammad and Salimans, Tim},
+    journal = {arXiv preprint arXiv:2106.15282},
+    year    = {2021}
+}
+```
+
 *Creating noise from data is easy; creating data from noise is generative modeling.* - <a href="https://arxiv.org/abs/2011.13456">Yang Song's paper</a>
--- a/dalle2_pytorch/init.py
+++ b/dalle2_pytorch/init.py
@@ -1,6 +1,6 @@
 from dalle2_pytorch.dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
 from dalle2_pytorch.dalle2_pytorch import OpenAIClipAdapter
-from dalle2_pytorch.train import DecoderTrainer, DiffusionPriorTrainer
+from dalle2_pytorch.trainer import DecoderTrainer, DiffusionPriorTrainer

 from dalle2_pytorch.vqgan_vae import VQGanVAE
 from x_clip import CLIP
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -61,6 +61,9 @@ def default(val, d):
 def cast_tuple(val, length = 1):
    return val if isinstance(val, tuple) else ((val,) * length)

+def module_device(module):
+    return next(module.parameters()).device
+
@contextmanager
 def null_context(*args, **kwargs):
    yield
@@ -794,7 +797,7 @@ class DiffusionPriorNetwork(nn.Module):
        text_embed,
        text_encodings = None,
        mask = None,
-        cond_drop_prob = 0.2
+        cond_drop_prob = 0.
    ):
        batch, dim, device, dtype = *image_embed.shape, image_embed.device, image_embed.dtype

@@ -901,6 +904,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        self.channels = default(image_channels, lambda: clip.image_channels)

        self.cond_drop_prob = cond_drop_prob
+        self.can_classifier_guidance = cond_drop_prob > 0.
        self.condition_on_text_encodings = condition_on_text_encodings

        # in paper, they do not predict the noise, but predict x0 directly for image embedding, claiming empirically better results. I'll just offer both.
@@ -914,8 +918,10 @@ class DiffusionPrior(BaseGaussianDiffusion):
        self.training_clamp_l2norm = training_clamp_l2norm
        self.init_image_embed_l2norm = init_image_embed_l2norm

-    def p_mean_variance(self, x, t, text_cond, clip_denoised: bool):
-        pred = self.net(x, t, **text_cond)
+    def p_mean_variance(self, x, t, text_cond, clip_denoised = False, cond_scale = 1.):
+        assert not (cond_scale != 1. and not self.can_classifier_guidance), 'the model was not trained with conditional dropout, and thus one cannot use classifier free guidance (cond_scale anything other than 1)'
+
+        pred = self.net.forward_with_cond_scale(x, t, cond_scale = cond_scale, **text_cond)

        if self.predict_x_start:
            x_recon = pred
@@ -933,17 +939,17 @@ class DiffusionPrior(BaseGaussianDiffusion):
        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
        return model_mean, posterior_variance, posterior_log_variance

-    @torch.inference_mode()
-    def p_sample(self, x, t, text_cond = None, clip_denoised = True, repeat_noise = False):
+    @torch.no_grad()
+    def p_sample(self, x, t, text_cond = None, clip_denoised = True, repeat_noise = False, cond_scale = 1.):
        b, *_, device = *x.shape, x.device
-        model_mean, _, model_log_variance = self.p_mean_variance(x = x, t = t, text_cond = text_cond, clip_denoised = clip_denoised)
+        model_mean, _, model_log_variance = self.p_mean_variance(x = x, t = t, text_cond = text_cond, clip_denoised = clip_denoised, cond_scale = cond_scale)
        noise = noise_like(x.shape, device, repeat_noise)
        # no noise when t == 0
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise

-    @torch.inference_mode()
-    def p_sample_loop(self, shape, text_cond):
+    @torch.no_grad()
+    def p_sample_loop(self, shape, text_cond, cond_scale = 1.):
        device = self.betas.device

        b = shape[0]
@@ -954,7 +960,7 @@ class DiffusionPrior(BaseGaussianDiffusion):

        for i in tqdm(reversed(range(0, self.num_timesteps)), desc='sampling loop time step', total=self.num_timesteps):
            times = torch.full((b,), i, device = device, dtype = torch.long)
-            image_embed = self.p_sample(image_embed, times, text_cond = text_cond)
+            image_embed = self.p_sample(image_embed, times, text_cond = text_cond, cond_scale = cond_scale)

        return image_embed

@@ -978,21 +984,21 @@ class DiffusionPrior(BaseGaussianDiffusion):
        loss = self.loss_fn(pred, target)
        return loss

-    @torch.inference_mode()
+    @torch.no_grad()
    @eval_decorator
-    def sample_batch_size(self, batch_size, text_cond):
+    def sample_batch_size(self, batch_size, text_cond, cond_scale = 1.):
        device = self.betas.device
        shape = (batch_size, self.image_embed_dim)

        img = torch.randn(shape, device = device)

        for i in tqdm(reversed(range(0, self.num_timesteps)), desc = 'sampling loop time step', total = self.num_timesteps):
-            img = self.p_sample(img, torch.full((batch_size,), i, device = device, dtype = torch.long), text_cond = text_cond)
+            img = self.p_sample(img, torch.full((batch_size,), i, device = device, dtype = torch.long), text_cond = text_cond, cond_scale = cond_scale)
        return img

-    @torch.inference_mode()
+    @torch.no_grad()
    @eval_decorator
-    def sample(self, text, num_samples_per_batch = 2):
+    def sample(self, text, num_samples_per_batch = 2, cond_scale = 1.):
        # in the paper, what they did was
        # sample 2 image embeddings, choose the top 1 similarity, as judged by CLIP
        text = repeat(text, 'b ... -> (b r) ...', r = num_samples_per_batch)
@@ -1007,7 +1013,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        if self.condition_on_text_encodings:
            text_cond = {**text_cond, 'text_encodings': text_encodings, 'mask': text_mask}

-        image_embeds = self.p_sample_loop((batch_size, image_embed_dim), text_cond = text_cond)
+        image_embeds = self.p_sample_loop((batch_size, image_embed_dim), text_cond = text_cond, cond_scale = cond_scale)

        # retrieve original unscaled image embed

@@ -1305,7 +1311,7 @@ class Unet(nn.Module):
        self,
        dim,
        *,
-        image_embed_dim,
+        image_embed_dim = None,
        text_embed_dim = None,
        cond_dim = None,
        num_image_tokens = 4,
@@ -1377,7 +1383,7 @@ class Unet(nn.Module):
        self.image_to_cond = nn.Sequential(
            nn.Linear(image_embed_dim, cond_dim * num_image_tokens),
            Rearrange('b (n d) -> b n d', n = num_image_tokens)
-        ) if image_embed_dim != cond_dim else nn.Identity()
+        ) if cond_on_image_embeds and image_embed_dim != cond_dim else nn.Identity()

        self.norm_cond = nn.LayerNorm(cond_dim)
        self.norm_mid_cond = nn.LayerNorm(cond_dim)
@@ -1387,7 +1393,8 @@ class Unet(nn.Module):
        self.text_to_cond = None

        if cond_on_text_encodings:
-            self.text_to_cond = nn.LazyLinear(cond_dim) if not exists(text_embed_dim) else nn.Linear(text_embed_dim, cond_dim)
+            assert exists(text_embed_dim), 'text_embed_dim must be given to the unet if cond_on_text_encodings is True'
+            self.text_to_cond = nn.Linear(text_embed_dim, cond_dim)

        # finer control over whether to condition on image embeddings and text encodings
        # so one can have the latter unets in the cascading DDPMs only focus on super-resoluting
@@ -1701,7 +1708,7 @@ class Decoder(BaseGaussianDiffusion):
        self.unconditional = unconditional
        assert not (condition_on_text_encodings and unconditional), 'unconditional decoder image generation cannot be set to True if conditioning on text is present'

-        assert exists(clip) ^ exists(image_size), 'either CLIP is supplied, or you must give the image_size and channels (usually 3 for RGB)'
+        assert self.unconditional or (exists(clip) ^ exists(image_size)), 'either CLIP is supplied, or you must give the image_size and channels (usually 3 for RGB)'

        self.clip = None
        if exists(clip):
@@ -1792,6 +1799,7 @@ class Decoder(BaseGaussianDiffusion):

        self.image_cond_drop_prob = image_cond_drop_prob
        self.text_cond_drop_prob = text_cond_drop_prob
+        self.can_classifier_guidance = image_cond_drop_prob > 0. or text_cond_drop_prob > 0.

        # whether to clip when sampling

@@ -1811,13 +1819,19 @@ class Decoder(BaseGaussianDiffusion):
            unet = self.get_unet(unet_number)

        self.cuda()
-        self.unets.cpu()

+        devices = [module_device(unet) for unet in self.unets]
+        self.unets.cpu()
        unet.cuda()
+
        yield
-        unet.cpu()
+
+        for unet, device in zip(self.unets, devices):
+            unet.to(device)

    def p_mean_variance(self, unet, x, t, image_embed, text_encodings = None, text_mask = None, lowres_cond_img = None, clip_denoised = True, predict_x_start = False, learned_variance = False, cond_scale = 1., model_output = None):
+        assert not (cond_scale != 1. and not self.can_classifier_guidance), 'the decoder was not trained with conditional dropout, and thus one cannot use classifier free guidance (cond_scale anything other than 1)'
+
        pred = default(model_output, lambda: unet.forward_with_cond_scale(x, t, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = cond_scale, lowres_cond_img = lowres_cond_img))

        if learned_variance:
@@ -1846,7 +1860,7 @@ class Decoder(BaseGaussianDiffusion):

        return model_mean, posterior_variance, posterior_log_variance

-    @torch.inference_mode()
+    @torch.no_grad()
    def p_sample(self, unet, x, t, image_embed, text_encodings = None, text_mask = None, cond_scale = 1., lowres_cond_img = None, predict_x_start = False, learned_variance = False, clip_denoised = True, repeat_noise = False):
        b, *_, device = *x.shape, x.device
        model_mean, _, model_log_variance = self.p_mean_variance(unet, x = x, t = t, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = cond_scale, lowres_cond_img = lowres_cond_img, clip_denoised = clip_denoised, predict_x_start = predict_x_start, learned_variance = learned_variance)
@@ -1855,14 +1869,15 @@ class Decoder(BaseGaussianDiffusion):
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise

-    @torch.inference_mode()
-    def p_sample_loop(self, unet, shape, image_embed, predict_x_start = False, learned_variance = False, clip_denoised = True, lowres_cond_img = None, text_encodings = None, text_mask = None, cond_scale = 1):
+    @torch.no_grad()
+    def p_sample_loop(self, unet, shape, image_embed, predict_x_start = False, learned_variance = False, clip_denoised = True, lowres_cond_img = None, text_encodings = None, text_mask = None, cond_scale = 1, is_latent_diffusion = False):
        device = self.betas.device

        b = shape[0]
        img = torch.randn(shape, device = device)

-        lowres_cond_img = maybe(normalize_neg_one_to_one)(lowres_cond_img)
+        if not is_latent_diffusion:
+            lowres_cond_img = maybe(normalize_neg_one_to_one)(lowres_cond_img)

        for i in tqdm(reversed(range(0, self.num_timesteps)), desc = 'sampling loop time step', total = self.num_timesteps):
            img = self.p_sample(
@@ -1882,13 +1897,14 @@ class Decoder(BaseGaussianDiffusion):
        unnormalize_img = unnormalize_zero_to_one(img)
        return unnormalize_img

-    def p_losses(self, unet, x_start, times, *, image_embed, lowres_cond_img = None, text_encodings = None, text_mask = None, predict_x_start = False, noise = None, learned_variance = False, clip_denoised = False):
+    def p_losses(self, unet, x_start, times, *, image_embed, lowres_cond_img = None, text_encodings = None, text_mask = None, predict_x_start = False, noise = None, learned_variance = False, clip_denoised = False, is_latent_diffusion = False):
        noise = default(noise, lambda: torch.randn_like(x_start))

        # normalize to [-1, 1]

-        x_start = normalize_neg_one_to_one(x_start)
-        lowres_cond_img = maybe(normalize_neg_one_to_one)(lowres_cond_img)
+        if not is_latent_diffusion:
+            x_start = normalize_neg_one_to_one(x_start)
+            lowres_cond_img = maybe(normalize_neg_one_to_one)(lowres_cond_img)

        # get x_t

@@ -1948,12 +1964,14 @@ class Decoder(BaseGaussianDiffusion):

        return loss + vb_loss

-    @torch.inference_mode()
+    @torch.no_grad()
    @eval_decorator
    def sample(
        self,
        image_embed = None,
        text = None,
+        text_mask = None,
+        text_encodings = None,
        batch_size = 1,
        cond_scale = 1.,
        stop_at_unet_number = None
@@ -1963,8 +1981,8 @@ class Decoder(BaseGaussianDiffusion):
        if not self.unconditional:
            batch_size = image_embed.shape[0]

-        text_encodings = text_mask = None
-        if exists(text):
+        if exists(text) and not exists(text_encodings) and not self.unconditional:
+            assert exists(self.clip)
            _, text_encodings, text_mask = self.clip.embed_text(text)

        assert not (self.condition_on_text_encodings and not exists(text_encodings)), 'text or text encodings must be passed into decoder if specified'
@@ -1988,8 +2006,7 @@ class Decoder(BaseGaussianDiffusion):
                image_size = vae.get_encoded_fmap_size(image_size)
                shape = (batch_size, vae.encoded_dim, image_size, image_size)

-                if exists(lowres_cond_img):
-                    lowres_cond_img = vae.encode(lowres_cond_img)
+                lowres_cond_img = maybe(vae.encode)(lowres_cond_img)

                img = self.p_sample_loop(
                    unet,
@@ -2001,7 +2018,8 @@ class Decoder(BaseGaussianDiffusion):
                    predict_x_start = predict_x_start,
                    learned_variance = learned_variance,
                    clip_denoised = not is_latent_diffusion,
-                    lowres_cond_img = lowres_cond_img
+                    lowres_cond_img = lowres_cond_img,
+                    is_latent_diffusion = is_latent_diffusion
                )

                img = vae.decode(img)
@@ -2017,6 +2035,7 @@ class Decoder(BaseGaussianDiffusion):
        text = None,
        image_embed = None,
        text_encodings = None,
+        text_mask = None,
        unet_number = None
    ):
        assert not (len(self.unets) > 1 and not exists(unet_number)), f'you must specify which unet you want trained, from a range of 1 to {len(self.unets)}, if you are training cascading DDPM (multiple unets)'
@@ -2037,12 +2056,11 @@ class Decoder(BaseGaussianDiffusion):

        times = torch.randint(0, self.num_timesteps, (b,), device = device, dtype = torch.long)

-        if not exists(image_embed):
+        if not exists(image_embed) and not self.unconditional:
            assert exists(self.clip), 'if you want to derive CLIP image embeddings automatically, you must supply `clip` to the decoder on init'
            image_embed, _ = self.clip.embed_image(image)

-        text_encodings = text_mask = None
-        if exists(text) and not exists(text_encodings):
+        if exists(text) and not exists(text_encodings) and not self.unconditional:
            assert exists(self.clip), 'if you are passing in raw text, you need to supply `clip` to the decoder'
            _, text_encodings, text_mask = self.clip.embed_text(text)

@@ -2060,14 +2078,14 @@ class Decoder(BaseGaussianDiffusion):
            image = aug(image)
            lowres_cond_img = aug(lowres_cond_img, params = aug._params)

+        is_latent_diffusion = not isinstance(vae, NullVQGanVAE)
+
        vae.eval()
        with torch.no_grad():
            image = vae.encode(image)
+            lowres_cond_img = maybe(vae.encode)(lowres_cond_img)

-            if exists(lowres_cond_img):
-                lowres_cond_img = vae.encode(lowres_cond_img)
-
-        return self.p_losses(unet, image, times, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, lowres_cond_img = lowres_cond_img, predict_x_start = predict_x_start, learned_variance = learned_variance)
+        return self.p_losses(unet, image, times, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, lowres_cond_img = lowres_cond_img, predict_x_start = predict_x_start, learned_variance = learned_variance, is_latent_diffusion = is_latent_diffusion)

 # main class

@@ -2090,22 +2108,23 @@ class DALLE2(nn.Module):

        self.to_pil = T.ToPILImage()

-    @torch.inference_mode()
+    @torch.no_grad()
    @eval_decorator
    def forward(
        self,
        text,
        cond_scale = 1.,
+        prior_cond_scale = 1.,
        return_pil_images = False
    ):
-        device = next(self.parameters()).device
+        device = module_device(self)
        one_text = isinstance(text, str) or (not is_list_str(text) and text.shape[0] == 1)

        if isinstance(text, str) or is_list_str(text):
            text = [text] if not isinstance(text, (list, tuple)) else text
            text = tokenizer.tokenize(text).to(device)

-        image_embed = self.prior.sample(text, num_samples_per_batch = self.prior_num_samples)
+        image_embed = self.prior.sample(text, num_samples_per_batch = self.prior_num_samples, cond_scale = prior_cond_scale)

        text_cond = text if self.decoder_need_text_cond else None
        images = self.decoder.sample(image_embed, text = text_cond, cond_scale = cond_scale)
--- a/dalle2_pytorch/dataloaders/init.py
+++ b/dalle2_pytorch/dataloaders/init.py
@@ -1 +1,2 @@
-from dalle2_pytorch.dataloaders.decoder_loader import ImageEmbeddingDataset, create_image_embedding_dataloader
+from dalle2_pytorch.dataloaders.decoder_loader import ImageEmbeddingDataset, create_image_embedding_dataloader
+from dalle2_pytorch.dataloaders.embedding_wrapper import make_splits
--- a/dalle2_pytorch/dataloaders/embedding_wrapper.py
+++ b/dalle2_pytorch/dataloaders/embedding_wrapper.py
@@ -0,0 +1,180 @@
+from torch.utils.data import IterableDataset
+from torch import from_numpy
+from clip import tokenize
+from embedding_reader import EmbeddingReader
+
+
+class PriorEmbeddingLoader(IterableDataset):
+    def __init__(
+        self,
+        text_conditioned: bool,
+        batch_size: int,
+        start: int,
+        stop: int,
+        image_reader,
+        text_reader: EmbeddingReader = None,
+        device: str = "cpu",
+    ) -> None:
+        super(PriorEmbeddingLoader).__init__()
+
+        self.text_conditioned = text_conditioned
+
+        if not self.text_conditioned:
+            self.text_reader = text_reader
+
+        self.image_reader = image_reader
+        self.batch_size = batch_size
+        self.start = start
+        self.stop = stop
+        self.device = device
+
+    def __iter__(self):
+        self.n = 0
+        loader_args = dict(
+            batch_size=self.batch_size,
+            start=self.start,
+            end=self.stop,
+            show_progress=False,
+        )
+        if self.text_conditioned:
+            self.loader = self.image_reader(**loader_args)
+        else:
+            self.loader = zip(
+                self.image_reader(**loader_args), self.text_reader(**loader_args)
+            )
+        return self
+
+    def __next__(self):
+        try:
+            return self.get_sample()
+        except StopIteration:
+            raise StopIteration
+
+    def get_sample(self):
+        """
+        pre-proocess data from either reader into a common format
+        """
+        self.n += 1
+
+        if self.text_conditioned:
+            image_embedding, caption = next(self.loader)
+
+            image_embedding = from_numpy(image_embedding).to(self.device)
+            tokenized_caption = tokenize(
+                caption["caption"].to_list(), truncate=True
+            ).to(self.device)
+
+            return image_embedding, tokenized_caption
+
+        else:
+            (image_embedding, _), (text_embedding, _) = next(self.loader)
+
+            image_embedding = from_numpy(image_embedding).to(self.device)
+            text_embedding = from_numpy(text_embedding).to(self.device)
+
+            return image_embedding, text_embedding
+
+
+def make_splits(
+    text_conditioned: bool,
+    batch_size: int,
+    num_data_points: int,
+    train_split: float,
+    eval_split: float,
+    device: str,
+    img_url: str,
+    meta_url: str = None,
+    txt_url: str = None,
+):
+
+    assert img_url is not None, "Must supply some image embeddings"
+
+    if text_conditioned:
+        assert meta_url is not None, "Must supply metadata url if text-conditioning"
+        image_reader = EmbeddingReader(
+            embeddings_folder=img_url,
+            file_format="parquet_npy",
+            meta_columns=["caption"],
+            metadata_folder=meta_url,
+        )
+
+        # compute split points
+        if num_data_points > image_reader.count:
+            print("Specified point count is larger than the number of points available...defaulting to max length of reader.")
+            num_data_points = image_reader.count
+
+        train_set_size = int(train_split * num_data_points)
+        eval_set_size = int(eval_split * num_data_points)
+        eval_stop = int(train_set_size + eval_set_size)
+
+        train_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            batch_size=batch_size,
+            start=0,
+            stop=train_set_size,
+            device=device,
+        )
+        eval_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            batch_size=batch_size,
+            start=train_set_size,
+            stop=eval_stop,
+            device=device,
+        )
+        test_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            batch_size=batch_size,
+            start=eval_stop,
+            stop=int(num_data_points),
+            device=device,
+        )
+
+    else:
+        assert (
+            txt_url is not None
+        ), "Must supply text embedding url if not text-conditioning"
+
+        image_reader = EmbeddingReader(img_url, file_format="npy")
+        text_reader = EmbeddingReader(txt_url, file_format="npy")
+
+        # compute split points
+        if num_data_points > image_reader.count:
+            print("Specified point count is larger than the number of points available...defaulting to max length of reader.")
+            num_data_points = image_reader.count
+
+        train_set_size = int(train_split * num_data_points)
+        eval_set_size = int(eval_split * num_data_points)
+        eval_stop = int(train_set_size + eval_set_size)
+
+        train_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            text_reader=text_reader,
+            batch_size=batch_size,
+            start=0,
+            stop=train_set_size,
+            device=device,
+        )
+        eval_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            text_reader=text_reader,
+            batch_size=batch_size,
+            start=train_set_size,
+            stop=eval_stop,
+            device=device,
+        )
+        test_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            text_reader=text_reader,
+            batch_size=batch_size,
+            start=eval_stop,
+            stop=int(num_data_points),
+            device=device,
+        )
+
+    return train_loader, eval_loader, test_loader
--- a/dalle2_pytorch/dataloaders/simple_image_only_dataloader.py
+++ b/dalle2_pytorch/dataloaders/simple_image_only_dataloader.py
@@ -0,0 +1,59 @@
+from pathlib import Path
+
+import torch
+from torch.utils import data
+from torchvision import transforms, utils
+
+from PIL import Image
+
+# helpers functions
+
+def cycle(dl):
+    while True:
+        for data in dl:
+            yield data
+
+# dataset and dataloader
+
+class Dataset(data.Dataset):
+    def __init__(
+        self,
+        folder,
+        image_size,
+        exts = ['jpg', 'jpeg', 'png']
+    ):
+        super().__init__()
+        self.folder = folder
+        self.image_size = image_size
+        self.paths = [p for ext in exts for p in Path(f'{folder}').glob(f'**/*.{ext}')]
+
+        self.transform = transforms.Compose([
+            transforms.Resize(image_size),
+            transforms.RandomHorizontalFlip(),
+            transforms.CenterCrop(image_size),
+            transforms.ToTensor()
+        ])
+
+    def __len__(self):
+        return len(self.paths)
+
+    def __getitem__(self, index):
+        path = self.paths[index]
+        img = Image.open(path)
+        return self.transform(img)
+
+def get_images_dataloader(
+    folder,
+    *,
+    batch_size,
+    image_size,
+    shuffle = True,
+    cycle_dl = True,
+    pin_memory = True
+):
+    ds = Dataset(folder, image_size)
+    dl = data.DataLoader(ds, batch_size = batch_size, shuffle = shuffle, pin_memory = pin_memory)
+
+    if cycle_dl:
+        dl = cycle(dl)
+    return dl
--- a/dalle2_pytorch/trainer.py
+++ b/dalle2_pytorch/trainer.py
@@ -1,7 +1,7 @@
 import time
 import copy
 from math import ceil
-from functools import partial
+from functools import partial, wraps
 from collections.abc import Iterable

 import torch
@@ -11,6 +11,8 @@ from torch.cuda.amp import autocast, GradScaler
 from dalle2_pytorch.dalle2_pytorch import Decoder, DiffusionPrior
 from dalle2_pytorch.optimizer import get_optimizer

+import numpy as np
+
 # helper functions

 def exists(val):
@@ -45,6 +47,29 @@ def groupby_prefix_and_trim(prefix, d):
    kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
    return kwargs_without_prefix, kwargs

+# decorators
+
+def cast_torch_tensor(fn):
+    @wraps(fn)
+    def inner(model, *args, **kwargs):
+        device = kwargs.pop('_device', next(model.parameters()).device)
+        cast_device = kwargs.pop('_cast_device', True)
+
+        kwargs_keys = kwargs.keys()
+        all_args = (*args, *kwargs.values())
+        split_kwargs_index = len(all_args) - len(kwargs_keys)
+        all_args = tuple(map(lambda t: torch.from_numpy(t) if exists(t) and isinstance(t, np.ndarray) else t, all_args))
+
+        if cast_device:
+            all_args = tuple(map(lambda t: t.to(device) if exists(t) and isinstance(t, torch.Tensor) else t, all_args))
+
+        args, kwargs_values = all_args[:split_kwargs_index], all_args[split_kwargs_index:]
+        kwargs = dict(tuple(zip(kwargs_keys, kwargs_values)))
+
+        out = fn(model, *args, **kwargs)
+        return out
+    return inner
+
 # gradient accumulation functions

 def split_iterable(it, split_size):
@@ -80,13 +105,13 @@ def split_args_and_kwargs(*args, split_size = None, **kwargs):

    batch_size = len(first_tensor)
    split_size = default(split_size, batch_size)
-    chunk_size = ceil(batch_size / split_size)
+    num_chunks = ceil(batch_size / split_size)

    dict_len = len(kwargs)
    dict_keys = kwargs.keys()
    split_kwargs_index = len_all_args - dict_len

-    split_all_args = [split(arg, split_size = split_size) if exists(arg) and isinstance(arg, (torch.Tensor, Iterable)) else ((arg,) * chunk_size) for arg in all_args]
+    split_all_args = [split(arg, split_size = split_size) if exists(arg) and isinstance(arg, (torch.Tensor, Iterable)) else ((arg,) * num_chunks) for arg in all_args]
    chunk_sizes = tuple(map(len, split_all_args[0]))

    for (chunk_size, *chunked_all_args) in tuple(zip(chunk_sizes, *split_all_args)):
@@ -154,8 +179,8 @@ class EMA(nn.Module):
        self.online_model = model
        self.ema_model = copy.deepcopy(model)

-        self.update_after_step = update_after_step # only start EMA after this step number, starting at 0
        self.update_every = update_every
+        self.update_after_step = update_after_step  // update_every # only start EMA after this step number, starting at 0

        self.register_buffer('initted', torch.Tensor([False]))
        self.register_buffer('step', torch.tensor([0.]))
@@ -164,6 +189,9 @@ class EMA(nn.Module):
        device = self.initted.device
        self.ema_model.to(device)

+    def copy_params_from_model_to_ema(self):
+        self.ema_model.state_dict(self.online_model.state_dict())
+
    def update(self):
        self.step += 1

@@ -171,7 +199,7 @@ class EMA(nn.Module):
            return

        if not self.initted:
-            self.ema_model.state_dict(self.online_model.state_dict())
+            self.copy_params_from_model_to_ema()
            self.initted.data.copy_(torch.Tensor([True]))

        self.update_moving_average(self.ema_model, self.online_model)
@@ -253,18 +281,21 @@ class DiffusionPriorTrainer(nn.Module):

        self.step += 1

-    @torch.inference_mode()
+    @torch.no_grad()
+    @cast_torch_tensor
    def p_sample_loop(self, *args, **kwargs):
        return self.ema_diffusion_prior.ema_model.p_sample_loop(*args, **kwargs)

-    @torch.inference_mode()
+    @torch.no_grad()
+    @cast_torch_tensor
    def sample(self, *args, **kwargs):
        return self.ema_diffusion_prior.ema_model.sample(*args, **kwargs)

-    @torch.inference_mode()
+    @torch.no_grad()
    def sample_batch_size(self, *args, **kwargs):
        return self.ema_diffusion_prior.ema_model.sample_batch_size(*args, **kwargs)

+    @cast_torch_tensor
    def forward(
        self,
        *args,
@@ -279,7 +310,9 @@ class DiffusionPriorTrainer(nn.Module):
                loss = loss * chunk_size_frac

            total_loss += loss.item()
-            self.scaler.scale(loss).backward()
+
+            if self.training:
+                self.scaler.scale(loss).backward()

        return total_loss

@@ -305,11 +338,6 @@ class DecoderTrainer(nn.Module):
        self.num_unets = len(self.decoder.unets)

        self.use_ema = use_ema
-
-        if use_ema:
-            has_lazy_linear = any([type(module) == nn.LazyLinear for module in decoder.modules()])
-            assert not has_lazy_linear, 'you must set the text_embed_dim on your u-nets if you plan on doing automatic exponential moving average'
-
        self.ema_unets = nn.ModuleList([])

        self.amp = amp
@@ -352,8 +380,11 @@ class DecoderTrainer(nn.Module):
        scaler = getattr(self, f'scaler{index}')
        return scaler.scale(loss)

-    def update(self, unet_number):
-        assert 1 <= unet_number <= self.num_unets
+    def update(self, unet_number = None):
+        if self.num_unets == 1:
+            unet_number = default(unet_number, 1)
+
+        assert exists(unet_number) and 1 <= unet_number <= self.num_unets
        index = unet_number - 1
        unet = self.decoder.unets[index]

@@ -375,7 +406,11 @@ class DecoderTrainer(nn.Module):
        self.step += 1

    @torch.no_grad()
+    @cast_torch_tensor
    def sample(self, *args, **kwargs):
+        if kwargs.pop('use_non_ema', False):
+            return self.decoder.sample(*args, **kwargs)
+
        if self.use_ema:
            trainable_unets = self.decoder.unets
            self.decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling
@@ -391,13 +426,17 @@ class DecoderTrainer(nn.Module):

        return output

+    @cast_torch_tensor
    def forward(
        self,
        *args,
-        unet_number,
+        unet_number = None,
        max_batch_size = None,
        **kwargs
    ):
+        if self.num_unets == 1:
+            unet_number = default(unet_number, 1)
+
        total_loss = 0.

        for chunk_size_frac, (chunked_args, chunked_kwargs) in split_args_and_kwargs(*args, split_size = max_batch_size, **kwargs):
@@ -406,6 +445,8 @@ class DecoderTrainer(nn.Module):
                loss = loss * chunk_size_frac

            total_loss += loss.item()
-            self.scale(loss, unet_number = unet_number).backward()
+
+            if self.training:
+                self.scale(loss, unet_number = unet_number).backward()

        return total_loss
--- a/dalle2_pytorch/vqgan_vae_trainer.py
+++ b/dalle2_pytorch/vqgan_vae_trainer.py
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.2.31',
+  version = '0.2.43',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
@@ -30,6 +30,7 @@ setup(
    'einops-exts>=0.0.3',
    'embedding-reader',
    'kornia>=0.5.4',
+    'numpy',
    'pillow',
    'resize-right>=0.0.2',
    'rotary-embedding-torch',
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -5,10 +5,13 @@ import time
 import numpy as np

 import torch
+import clip
 from torch import nn

-from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork
-from dalle2_pytorch.train import DiffusionPriorTrainer, load_diffusion_model, save_diffusion_model, print_ribbon
+from dalle2_pytorch.dataloaders import make_splits
+from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork, OpenAIClipAdapter
+from dalle2_pytorch.trainer import DiffusionPriorTrainer, load_diffusion_model, save_diffusion_model, print_ribbon
+
 from dalle2_pytorch.trackers import ConsoleTracker, WandbTracker

 from embedding_reader import EmbeddingReader
@@ -17,8 +20,7 @@ from tqdm import tqdm

 # constants

-NUM_TEST_EMBEDDINGS = 100 # for cosine similarity reporting during training
-REPORT_METRICS_EVERY = 100 # for cosine similarity and other metric reporting during training
+REPORT_METRICS_EVERY = 250 # for cosine similarity and other metric reporting during training

 tracker = WandbTracker()

@@ -36,112 +38,216 @@ class Timer:

    def elapsed(self):
        return time.time() - self.last_time
+
 # functions

-def eval_model(model,device,image_reader,text_reader,start,end,batch_size,loss_type,phase="Validation"):
+def eval_model(model, dataloader, text_conditioned, loss_type, phase="Validation"):
    model.eval()
+
    with torch.no_grad():
        total_loss = 0.
        total_samples = 0.

-        for emb_images, emb_text in zip(image_reader(batch_size=batch_size, start=start, end=end),
-                text_reader(batch_size=batch_size, start=start, end=end)):
+        for image_embeddings, text_data in tqdm(dataloader):

-            emb_images_tensor = torch.tensor(emb_images[0]).to(device)
-            emb_text_tensor = torch.tensor(emb_text[0]).to(device)
+            batches = image_embeddings.shape[0]

-            batches = emb_images_tensor.shape[0]
+            input_args = dict(image_embed=image_embeddings)
+            if text_conditioned:
+                input_args = dict(**input_args, text = text_data)
+            else:
+                input_args = dict(**input_args, text_embed=text_data)

-            loss = model(text_embed = emb_text_tensor, image_embed = emb_images_tensor)
+            loss = model(**input_args)

-            total_loss += loss.item() * batches
+            total_loss += loss * batches
            total_samples += batches

        avg_loss = (total_loss / total_samples)
+
        tracker.log({f'{phase} {loss_type}': avg_loss})

-def report_cosine_sims(diffusion_prior,image_reader,text_reader,train_set_size,NUM_TEST_EMBEDDINGS,device):
+def report_cosine_sims(diffusion_prior, dataloader, text_conditioned):
    diffusion_prior.eval()

    cos = nn.CosineSimilarity(dim=1, eps=1e-6)

-    tstart = train_set_size
-    tend = train_set_size+NUM_TEST_EMBEDDINGS
+    for test_image_embeddings, text_data in tqdm(dataloader):
+
+        # we are text conditioned, we produce an embedding from the tokenized text
+        if text_conditioned:
+            text_embedding, text_encodings, text_mask = diffusion_prior.clip.embed_text(
+                text_data)
+            text_cond = dict(text_embed=text_embedding,
+                             text_encodings=text_encodings, mask=text_mask)
+        else:
+            text_embedding = text_data
+            text_cond = dict(text_embed=text_embedding)
+
+        # make a copy of the text embeddings for shuffling
+        text_embed_shuffled = text_embedding.clone()
+
+        # roll the text to simulate "unrelated" captions
+        rolled_idx = torch.roll(torch.arange(text_embedding.shape[0]), 1)
+        text_embed_shuffled = text_embed_shuffled[rolled_idx]
+        text_embed_shuffled = text_embed_shuffled / \
+            text_embed_shuffled.norm(dim=1, keepdim=True)
+
+        if text_conditioned:
+            text_encodings_shuffled = text_encodings[rolled_idx]
+            text_mask_shuffled = text_mask[rolled_idx]
+        else:
+            text_encodings_shuffled = None
+            text_mask_shuffled = None
+
+        text_cond_shuffled = dict(text_embed=text_embed_shuffled,
+                                  text_encodings=text_encodings_shuffled, mask=text_mask_shuffled)

-    for embt, embi in zip(text_reader(batch_size=NUM_TEST_EMBEDDINGS, start=tstart, end=tend), 
-            image_reader(batch_size=NUM_TEST_EMBEDDINGS, start=tstart, end=tend)):
-       # make a copy of the text embeddings for shuffling
-       text_embed = torch.tensor(embt[0]).to(device)
-       text_embed_shuffled = text_embed.clone()
-        # roll the text embeddings to simulate "unrelated" captions
-       rolled_idx = torch.roll(torch.arange(NUM_TEST_EMBEDDINGS), 1)
-       text_embed_shuffled = text_embed_shuffled[rolled_idx]
-       text_embed_shuffled = text_embed_shuffled / \
-           text_embed_shuffled.norm(dim=1, keepdim=True)
-       test_text_shuffled_cond = dict(text_embed=text_embed_shuffled)
        # prepare the text embedding
-       text_embed = text_embed / text_embed.norm(dim=1, keepdim=True)
-       test_text_cond = dict(text_embed=text_embed)
+        text_embed = text_embedding / text_embedding.norm(dim=1, keepdim=True)
+
        # prepare image embeddings
-       test_image_embeddings = torch.tensor(embi[0]).to(device)
-       test_image_embeddings = test_image_embeddings / \
-           test_image_embeddings.norm(dim=1, keepdim=True)
+        test_image_embeddings = test_image_embeddings / \
+            test_image_embeddings.norm(dim=1, keepdim=True)
+
        # predict on the unshuffled text embeddings
-       predicted_image_embeddings = diffusion_prior.p_sample_loop(
-           (NUM_TEST_EMBEDDINGS, 768), text_cond=test_text_cond)
-       predicted_image_embeddings = predicted_image_embeddings / \
-           predicted_image_embeddings.norm(dim=1, keepdim=True)
+        predicted_image_embeddings = diffusion_prior.p_sample_loop(
+            test_image_embeddings.shape, text_cond)
+        predicted_image_embeddings = predicted_image_embeddings / \
+            predicted_image_embeddings.norm(dim=1, keepdim=True)
+
        # predict on the shuffled embeddings
-       predicted_unrelated_embeddings = diffusion_prior.p_sample_loop(
-           (NUM_TEST_EMBEDDINGS, 768), text_cond=test_text_shuffled_cond)
-       predicted_unrelated_embeddings = predicted_unrelated_embeddings / \
-           predicted_unrelated_embeddings.norm(dim=1, keepdim=True)
+        predicted_unrelated_embeddings = diffusion_prior.p_sample_loop(
+            test_image_embeddings.shape, text_cond_shuffled)
+        predicted_unrelated_embeddings = predicted_unrelated_embeddings / \
+            predicted_unrelated_embeddings.norm(dim=1, keepdim=True)
+
        # calculate similarities
-       original_similarity = cos(
+        original_similarity = cos(
           text_embed, test_image_embeddings).cpu().numpy()
-       predicted_similarity = cos(
+        predicted_similarity = cos(
           text_embed, predicted_image_embeddings).cpu().numpy()
-       unrelated_similarity = cos(
+        unrelated_similarity = cos(
           text_embed, predicted_unrelated_embeddings).cpu().numpy()
-       predicted_img_similarity = cos(
+        predicted_img_similarity = cos(
           test_image_embeddings, predicted_image_embeddings).cpu().numpy()
-       tracker.log({"CosineSimilarity(text_embed,image_embed)": np.mean(original_similarity),
+        tracker.log({"CosineSimilarity(text_embed,image_embed)": np.mean(original_similarity),
            "CosineSimilarity(text_embed,predicted_image_embed)":np.mean(predicted_similarity),
            "CosineSimilarity(orig_image_embed,predicted_image_embed)":np.mean(predicted_img_similarity),
            "CosineSimilarity(text_embed,predicted_unrelated_embed)": np.mean(unrelated_similarity),
            "Cosine similarity difference":np.mean(predicted_similarity - original_similarity)})

-def train(image_embed_dim,
-          image_embed_url,
-          text_embed_url,
-          batch_size,
-          train_percent,
-          val_percent,
-          test_percent,
-          num_epochs,
-          dp_loss_type,
-          clip,
-          dp_condition_on_text_encodings,
-          dp_timesteps,
-          dp_normformer,
-          dp_cond_drop_prob,
-          dpn_depth,
-          dpn_dim_head,
-          dpn_heads,
-          save_interval,
-          save_path,
-          device,
-          RESUME,
-          DPRIOR_PATH,
-          config,
-          wandb_entity,
-          wandb_project,
-          learning_rate=0.001,
-          max_grad_norm=0.5,
-          weight_decay=0.01,
-          dropout=0.05,
-          amp=False):

+@click.command()
+@click.option("--wandb-entity", default="laion")
+@click.option("--wandb-project", default="diffusion-prior")
+@click.option("--wandb-dataset", default="LAION-5B")
+@click.option("--wandb-arch", default="DiffusionPrior")
+@click.option("--image-embed-url", default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/")
+@click.option("--text-embed-url", default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/")
+@click.option("--meta-url", default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/laion2B-en-metadata/")
+@click.option("--learning-rate", default=1.1e-4)
+@click.option("--weight-decay", default=6.02e-2)
+@click.option("--dropout", default=5e-2)
+@click.option("--max-grad-norm", default=0.5)
+@click.option("--num-data-points", default=250e6)
+@click.option("--batch-size", default=320)
+@click.option("--num-epochs", default=5)
+@click.option("--image-embed-dim", default=768)
+@click.option("--train-percent", default=0.9)
+@click.option("--val-percent", default=1e-7)
+@click.option("--test-percent", default=0.0999999)
+@click.option("--dpn-depth", default=12)
+@click.option("--dpn-dim-head", default=64)
+@click.option("--dpn-heads", default=12)
+@click.option("--dp-condition-on-text-encodings", default=True)
+@click.option("--dp-timesteps", default=1000)
+@click.option("--dp-normformer", default=True)
+@click.option("--dp-cond-drop-prob", default=0.1)
+@click.option("--dp-loss-type", default="l2")
+@click.option("--clip", default="ViT-L/14")
+@click.option("--amp", default=False)
+@click.option("--save-interval", default=120)
+@click.option("--save-path", default="./diffusion_prior_checkpoints")
+@click.option("--pretrained-model-path", default=None)
+@click.option("--gpu-device", default=0)
+def train(
+    wandb_entity,
+    wandb_project,
+    wandb_dataset,
+    wandb_arch,
+    image_embed_url,
+    text_embed_url,
+    meta_url,
+    learning_rate,
+    weight_decay,
+    dropout,
+    max_grad_norm,
+    num_data_points,
+    batch_size,
+    num_epochs,
+    image_embed_dim,
+    train_percent,
+    val_percent,
+    test_percent,
+    dpn_depth,
+    dpn_dim_head,
+    dpn_heads,
+    dp_condition_on_text_encodings,
+    dp_timesteps,
+    dp_normformer,
+    dp_cond_drop_prob,
+    dp_loss_type,
+    clip,
+    amp,
+    save_interval,
+    save_path,
+    pretrained_model_path,
+    gpu_device
+):
+    config = {
+        "learning_rate": learning_rate,
+        "architecture": wandb_arch,
+        "dataset": wandb_dataset,
+        "weight_decay": weight_decay,
+        "max_gradient_clipping_norm": max_grad_norm,
+        "batch_size": batch_size,
+        "epochs": num_epochs,
+        "diffusion_prior_network": {
+            "depth": dpn_depth,
+            "dim_head": dpn_dim_head,
+            "heads": dpn_heads,
+            "normformer": dp_normformer
+        },
+        "diffusion_prior": {
+            "condition_on_text_encodings": dp_condition_on_text_encodings,
+            "timesteps": dp_timesteps,
+            "cond_drop_prob": dp_cond_drop_prob,
+            "loss_type": dp_loss_type,
+            "clip": clip
+        }
+    }
+
+    # Check if DPRIOR_PATH exists(saved model path)
+
+    DPRIOR_PATH = pretrained_model_path
+    RESUME = exists(DPRIOR_PATH)
+
+    if not RESUME:
+        tracker.init(
+            entity = wandb_entity,
+            project = wandb_project,
+            config = config
+        )
+
+    # Obtain the utilized device.
+
+    has_cuda = torch.cuda.is_available()
+    if has_cuda:
+        device = torch.device(f"cuda:{gpu_device}")
+        torch.cuda.set_device(device)
+
+    # Training loop
    # diffusion prior network

    prior_network = DiffusionPriorNetwork( 
@@ -154,11 +260,17 @@ def train(image_embed_dim,
        normformer = dp_normformer
    )
    
+    # Load clip model if text-conditioning
+    if dp_condition_on_text_encodings:
+        clip_adapter = OpenAIClipAdapter(clip)
+    else:
+        clip_adapter = None
+        
    # diffusion prior with text embeddings and image embeddings pre-computed

    diffusion_prior = DiffusionPrior( 
        net = prior_network,
-        clip = clip,
+        clip = clip_adapter,
        image_embed_dim = image_embed_dim,
        timesteps = dp_timesteps,
        cond_drop_prob = dp_cond_drop_prob,
@@ -192,33 +304,37 @@ def train(image_embed_dim,

    Path(save_path).mkdir(exist_ok = True, parents = True)

-    # Get image and text embeddings from the servers
+    # Utilize wrapper to abstract away loader logic
+    print_ribbon("Downloading Embeddings")
+    loader_args = dict(text_conditioned=dp_condition_on_text_encodings, batch_size=batch_size, num_data_points=num_data_points,
+                       train_split=train_percent, eval_split=val_percent, device=device, img_url=image_embed_url)

-    print_ribbon("Downloading embeddings - image and text")
-    image_reader = EmbeddingReader(embeddings_folder=image_embed_url, file_format="npy")
-    text_reader  = EmbeddingReader(embeddings_folder=text_embed_url, file_format="npy")
-    num_data_points = text_reader.count
+    if dp_condition_on_text_encodings:
+        loader_args = dict(**loader_args, meta_url=meta_url)
+    else:
+        loader_args = dict(**loader_args, txt_url=text_embed_url)
+
+    train_loader, eval_loader, test_loader = make_splits(**loader_args)

    ### Training code ###

+    step = 1 
    timer = Timer()
    epochs = num_epochs

-    train_set_size = int(train_percent*num_data_points)
-    val_set_size = int(val_percent*num_data_points)
-    eval_start = train_set_size
-
    for _ in range(epochs):

-        for emb_images,emb_text in zip(image_reader(batch_size=batch_size, start=0, end=train_set_size),
-                text_reader(batch_size=batch_size, start=0, end=train_set_size)):
-
-            trainer.train()
+        for image, text in tqdm(train_loader):
            
-            emb_images_tensor = torch.tensor(emb_images[0]).to(device)
-            emb_text_tensor = torch.tensor(emb_text[0]).to(device)
+            diffusion_prior.train()
+            
+            input_args = dict(image_embed=image)
+            if dp_condition_on_text_encodings:
+                input_args = dict(**input_args, text = text)
+            else:
+                input_args = dict(**input_args, text_embed=text)

-            loss = trainer(text_embed = emb_text_tensor, image_embed = emb_images_tensor)
+            loss = trainer(**input_args)

            # Samples per second

@@ -237,172 +353,23 @@ def train(image_embed_dim,
                    image_embed_dim)

            # Log to wandb
-            tracker.log({"Training loss": loss.item(),
+            tracker.log({"Training loss": loss,
                        "Steps": step,
                        "Samples per second": samples_per_sec})
            # Log cosineSim(text_embed,predicted_image_embed) - cosineSim(text_embed,image_embed)
            # Use NUM_TEST_EMBEDDINGS samples from the test set each time
            # Get embeddings from the most recently saved model
            if(step % REPORT_METRICS_EVERY) == 0:
-                report_cosine_sims(diffusion_prior,
-                        image_reader,
-                        text_reader,
-                        train_set_size,
-                        NUM_TEST_EMBEDDINGS,
-                        device)
+                report_cosine_sims(diffusion_prior, eval_loader, dp_condition_on_text_encodings)
                ### Evaluate model(validation run) ###
-                eval_model(diffusion_prior,
-                        device,
-                        image_reader,
-                        text_reader,
-                        eval_start,
-                        eval_start+NUM_TEST_EMBEDDINGS,
-                        NUM_TEST_EMBEDDINGS,
-                        dp_loss_type,
-                        phase="Validation")
+                eval_model(diffusion_prior, eval_loader, dp_condition_on_text_encodings, dp_loss_type, phase="Validation")

+            step += 1
            trainer.update()

    ### Test run ###
-    test_set_size = int(test_percent*train_set_size) 
-    start = train_set_size+val_set_size
-    end = num_data_points
-    eval_model(diffusion_prior,device,image_reader,text_reader,start,end,batch_size,dp_loss_type,phase="Test")
+    eval_model(diffusion_prior, test_loader, dp_condition_on_text_encodings, dp_loss_type, phase="Test")

-@click.command()
-@click.option("--wandb-entity", default="laion")
-@click.option("--wandb-project", default="diffusion-prior")
-@click.option("--wandb-dataset", default="LAION-5B")
-@click.option("--wandb-arch", default="DiffusionPrior")
-@click.option("--image-embed-url", default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/")
-@click.option("--text-embed-url", default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/")
-@click.option("--learning-rate", default=1.1e-4)
-@click.option("--weight-decay", default=6.02e-2)
-@click.option("--dropout", default=5e-2)
-@click.option("--max-grad-norm", default=0.5)
-@click.option("--batch-size", default=10**4)
-@click.option("--num-epochs", default=5)
-@click.option("--image-embed-dim", default=768)
-@click.option("--train-percent", default=0.7)
-@click.option("--val-percent", default=0.2)
-@click.option("--test-percent", default=0.1)
-@click.option("--dpn-depth", default=6)
-@click.option("--dpn-dim-head", default=64)
-@click.option("--dpn-heads", default=8)
-@click.option("--dp-condition-on-text-encodings", default=False)
-@click.option("--dp-timesteps", default=100)
-@click.option("--dp-normformer", default=False)
-@click.option("--dp-cond-drop-prob", default=0.1)
-@click.option("--dp-loss-type", default="l2")
-@click.option("--clip", default=None)
-@click.option("--amp", default=False)
-@click.option("--save-interval", default=30)
-@click.option("--save-path", default="./diffusion_prior_checkpoints")
-@click.option("--pretrained-model-path", default=None)
-def main(
-    wandb_entity,
-    wandb_project,
-    wandb_dataset,
-    wandb_arch,
-    image_embed_url,
-    text_embed_url,
-    learning_rate,
-    weight_decay,
-    dropout,
-    max_grad_norm,
-    batch_size,
-    num_epochs,
-    image_embed_dim,
-    train_percent,
-    val_percent,
-    test_percent,
-    dpn_depth,
-    dpn_dim_head,
-    dpn_heads,
-    dp_condition_on_text_encodings,
-    dp_timesteps,
-    dp_normformer,
-    dp_cond_drop_prob,
-    dp_loss_type,
-    clip,
-    amp,
-    save_interval,
-    save_path,
-    pretrained_model_path
-):
-    config = {
-        "learning_rate": learning_rate,
-        "architecture": wandb_arch,
-        "dataset": wandb_dataset,
-        "weight_decay": weight_decay,
-        "max_gradient_clipping_norm": max_grad_norm,
-        "batch_size": batch_size,
-        "epochs": num_epochs,
-        "diffusion_prior_network": {
-            "depth": dpn_depth,
-            "dim_head": dpn_dim_head,
-            "heads": dpn_heads,
-            "normformer": dp_normformer
-        },
-        "diffusion_prior": {
-            "condition_on_text_encodings": dp_condition_on_text_encodings,
-            "timesteps": dp_timesteps,
-            "cond_drop_prob": dp_cond_drop_prob,
-            "loss_type": dp_loss_type,
-            "clip": clip
-        }
-    }
-
-    # Check if DPRIOR_PATH exists(saved model path)
-
-    DPRIOR_PATH = args.pretrained_model_path
-    RESUME = exists(DPRIOR_PATH)
-
-    if not RESUME:
-        tracker.init(
-            entity = wandb_entity,
-            project = wandb_project,
-            config = config
-        )
-
-    # Obtain the utilized device.
-
-    has_cuda = torch.cuda.is_available()
-    if has_cuda:
-        device = torch.device("cuda:0")
-        torch.cuda.set_device(device)
-
-    # Training loop
-    train(image_embed_dim,
-          image_embed_url,
-          text_embed_url,
-          batch_size,
-          train_percent,
-          val_percent,
-          test_percent,
-          num_epochs,
-          dp_loss_type,
-          clip,
-          dp_condition_on_text_encodings,
-          dp_timesteps,
-          dp_normformer,
-          dp_cond_drop_prob,
-          dpn_depth,
-          dpn_dim_head,
-          dpn_heads,
-          save_interval,
-          save_path,
-          device,
-          RESUME,
-          DPRIOR_PATH,
-          config,
-          wandb_entity,
-          wandb_project,
-          learning_rate,
-          max_grad_norm,
-          weight_decay,
-          dropout,
-          amp)

 if __name__ == "__main__":
-    main()
+    train()
Author	SHA1	Message	Date
Phil Wang	f4016f6302	allow for overriding use of EMA during sampling in decoder trainer with use_non_ema keyword, also fix some issues with automatic normalization of images and low res conditioning image if latent diffusion is in play	2022-05-16 11:18:30 -07:00
Phil Wang	1212f7058d	allow text encodings and text mask to be passed in on forward and sampling for Decoder class	2022-05-16 10:40:32 -07:00
Phil Wang	dab106d4e5	back to no_grad for now, also keep track and restore unet devices in one_unet_in_gpu contextmanager	2022-05-16 09:36:14 -07:00
Phil Wang	bb151ca6b1	unet_number on decoder trainer only needs to be passed in if there is greater than 1 unet, so that unconditional training of a single ddpm is seamless (experiment in progress locally)	2022-05-16 09:17:17 -07:00
zion	4a59dea4cf	Migrate to text-conditioned prior training (#95 ) * migrate to conditioned prior * unify reader logic with a wrapper (#1) * separate out reader logic * support both training methods * Update train prior to use embedding wrapper (#3) * Support Both Methods * bug fixes * small bug fixes * embedding only wrapper bug * use smaller val perc * final bug fix for embedding-only Co-authored-by: nousr <>	2022-05-15 20:16:38 -07:00
Phil Wang	ecf9e8027d	make sure classifier free guidance is used only if conditional dropout is present on the DiffusionPrior and Decoder classes. also make sure prior can have a different conditional scale than decoder	2022-05-15 19:09:38 -07:00
Phil Wang	36c5079bd7	LazyLinear is not mature, make users pass in text_embed_dim if text conditioning is turned on	2022-05-15 18:56:52 -07:00
Phil Wang	4a4c7ac9e6	cond drop prob for diffusion prior network should default to 0	2022-05-15 18:47:45 -07:00
Phil Wang	fad7481479	todo	2022-05-15 17:00:25 -07:00
Phil Wang	123658d082	cite Ho et al, since cascading ddpm is now trainable	2022-05-15 16:56:53 -07:00
Phil Wang	11d4e11f10	allow for training unconditional ddpm or cascading ddpms	2022-05-15 16:54:56 -07:00
Phil Wang	99778e12de	trainer classes now takes care of auto-casting numpy to torch tensors, and setting correct device based on model parameter devices	2022-05-15 15:25:45 -07:00
Phil Wang	0f0011caf0	todo	2022-05-15 14:28:35 -07:00
Phil Wang	7b7a62044a	use eval vs training mode to determine whether to call backprop on trainer forward	2022-05-15 14:20:59 -07:00
Phil Wang	156fe5ed9f	final cleanup for the day	2022-05-15 12:38:41 -07:00
Phil Wang	5ec34bebe1	cleanup readme	2022-05-15 12:29:26 -07:00
Phil Wang	8eaacf1ac1	remove indirection	2022-05-15 12:05:45 -07:00
Phil Wang	e66c7b0249	incorrect naming	2022-05-15 11:23:52 -07:00
Phil Wang	f7cd4a0992	product management	2022-05-15 11:21:12 -07:00