fix wandb logging in tracker, and do some cleanup

Implemented the wandb tracker (#106 )
Added a base_path parameter to all trackers for storing any local information they need to
2026-02-12 11:34:29 +01:00 · 2022-05-20 17:10:33 -07:00 · 2022-05-20 16:39:23 -07:00 · 2022-05-20 16:38:55 -07:00 · 2022-05-18 20:22:52 -07:00 · 2022-05-16 17:38:30 -07:00
16 changed files with 720 additions and 184 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,6 @@
+# default experiment tracker data
+.tracker-data/
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
--- a/README.md
+++ b/README.md
@@ -14,6 +14,16 @@ Please join <a href="https://discord.gg/xBPBXfcFHd"><img alt="Join us on Discord

 There was enough interest for a <a href="https://github.com/lucidrains/dalle2-jax">Jax version</a>. I will also eventually extend this to <a href="https://github.com/lucidrains/dalle2-video">text to video</a>, once the repository is in a good place.

+## Status
+
+- A research group has used the code in this repository to train a functional diffusion prior for their CLIP generations. Will share their work once they release their preprint. This, and <a href="https://github.com/crowsonkb">Katherine's</a> own experiments, validate OpenAI's finding that the extra prior increases variety of generations.
+
+- Decoder is now verified working for unconditional generation on my experimental setup for Oxford flowers. 2 researchers have also confirmed Decoder is working for them.
+
+<img src="./samples/oxford.png" width="600px" />
+
+*ongoing at 21k steps*
+
 ## Install

 ```bash
@@ -814,8 +824,8 @@ clip = CLIP(

 # mock data

-text = torch.randint(0, 49408, (32, 256)).cuda()
-images = torch.randn(32, 3, 256, 256).cuda()
+text = torch.randint(0, 49408, (512, 256)).cuda()
+images = torch.randn(512, 3, 256, 256).cuda()

 # prior networks (with transformer)

@@ -848,7 +858,7 @@ diffusion_prior_trainer.update()  # this will update the optimizer as well as th
 # after much of the above three lines in a loop
 # you can sample from the exponential moving average of the diffusion prior identically to how you do so for DiffusionPrior

-image_embeds = diffusion_prior_trainer.sample(text) # (4, 512) - exponential moving averaged image embeddings
+image_embeds = diffusion_prior_trainer.sample(text, max_batch_size = 4) # (512, 512) - exponential moving averaged image embeddings
 ```

 ## Bonus
@@ -861,7 +871,7 @@ ex.

 ```python
 import torch
-from dalle2_pytorch import Unet, Decoder
+from dalle2_pytorch import Unet, Decoder, DecoderTrainer

 # unet for the cascading ddpm

@@ -884,20 +894,24 @@ decoder = Decoder(
    unconditional = True
 ).cuda()

-# mock images (get a lot of this)
+# decoder trainer
+
+decoder_trainer = DecoderTrainer(decoder)
+
+# images (get a lot of this)

 images = torch.randn(1, 3, 512, 512).cuda()

 # feed images into decoder

 for i in (1, 2):
-    loss = decoder(images, unet_number = i)
-    loss.backward()
+    loss = decoder_trainer(images, unet_number = i)
+    decoder_trainer.update(unet_number = i)

-# do the above for many many many many steps
+# do the above for many many many many images
 # then it will learn to generate images

-images = decoder.sample(batch_size = 2) # (2, 3, 512, 512)
+images = decoder_trainer.sample(batch_size = 36, max_batch_size = 4) # (36, 3, 512, 512)
 ```

 ## Dataloaders
--- a/dalle2_pytorch/init.py
+++ b/dalle2_pytorch/init.py
@@ -1,6 +1,6 @@
 from dalle2_pytorch.dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
 from dalle2_pytorch.dalle2_pytorch import OpenAIClipAdapter
-from dalle2_pytorch.train import DecoderTrainer, DiffusionPriorTrainer
+from dalle2_pytorch.trainer import DecoderTrainer, DiffusionPriorTrainer

 from dalle2_pytorch.vqgan_vae import VQGanVAE
 from x_clip import CLIP
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -61,6 +61,9 @@ def default(val, d):
 def cast_tuple(val, length = 1):
    return val if isinstance(val, tuple) else ((val,) * length)

+def module_device(module):
+    return next(module.parameters()).device
+
@contextmanager
 def null_context(*args, **kwargs):
    yield
@@ -901,6 +904,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        self.channels = default(image_channels, lambda: clip.image_channels)

        self.cond_drop_prob = cond_drop_prob
+        self.can_classifier_guidance = cond_drop_prob > 0.
        self.condition_on_text_encodings = condition_on_text_encodings

        # in paper, they do not predict the noise, but predict x0 directly for image embedding, claiming empirically better results. I'll just offer both.
@@ -914,8 +918,10 @@ class DiffusionPrior(BaseGaussianDiffusion):
        self.training_clamp_l2norm = training_clamp_l2norm
        self.init_image_embed_l2norm = init_image_embed_l2norm

-    def p_mean_variance(self, x, t, text_cond, clip_denoised: bool):
-        pred = self.net(x, t, **text_cond)
+    def p_mean_variance(self, x, t, text_cond, clip_denoised = False, cond_scale = 1.):
+        assert not (cond_scale != 1. and not self.can_classifier_guidance), 'the model was not trained with conditional dropout, and thus one cannot use classifier free guidance (cond_scale anything other than 1)'
+
+        pred = self.net.forward_with_cond_scale(x, t, cond_scale = cond_scale, **text_cond)

        if self.predict_x_start:
            x_recon = pred
@@ -933,17 +939,17 @@ class DiffusionPrior(BaseGaussianDiffusion):
        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
        return model_mean, posterior_variance, posterior_log_variance

-    @torch.inference_mode()
-    def p_sample(self, x, t, text_cond = None, clip_denoised = True, repeat_noise = False):
+    @torch.no_grad()
+    def p_sample(self, x, t, text_cond = None, clip_denoised = True, repeat_noise = False, cond_scale = 1.):
        b, *_, device = *x.shape, x.device
-        model_mean, _, model_log_variance = self.p_mean_variance(x = x, t = t, text_cond = text_cond, clip_denoised = clip_denoised)
+        model_mean, _, model_log_variance = self.p_mean_variance(x = x, t = t, text_cond = text_cond, clip_denoised = clip_denoised, cond_scale = cond_scale)
        noise = noise_like(x.shape, device, repeat_noise)
        # no noise when t == 0
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise

-    @torch.inference_mode()
-    def p_sample_loop(self, shape, text_cond):
+    @torch.no_grad()
+    def p_sample_loop(self, shape, text_cond, cond_scale = 1.):
        device = self.betas.device

        b = shape[0]
@@ -954,7 +960,7 @@ class DiffusionPrior(BaseGaussianDiffusion):

        for i in tqdm(reversed(range(0, self.num_timesteps)), desc='sampling loop time step', total=self.num_timesteps):
            times = torch.full((b,), i, device = device, dtype = torch.long)
-            image_embed = self.p_sample(image_embed, times, text_cond = text_cond)
+            image_embed = self.p_sample(image_embed, times, text_cond = text_cond, cond_scale = cond_scale)

        return image_embed

@@ -978,21 +984,21 @@ class DiffusionPrior(BaseGaussianDiffusion):
        loss = self.loss_fn(pred, target)
        return loss

-    @torch.inference_mode()
+    @torch.no_grad()
    @eval_decorator
-    def sample_batch_size(self, batch_size, text_cond):
+    def sample_batch_size(self, batch_size, text_cond, cond_scale = 1.):
        device = self.betas.device
        shape = (batch_size, self.image_embed_dim)

        img = torch.randn(shape, device = device)

        for i in tqdm(reversed(range(0, self.num_timesteps)), desc = 'sampling loop time step', total = self.num_timesteps):
-            img = self.p_sample(img, torch.full((batch_size,), i, device = device, dtype = torch.long), text_cond = text_cond)
+            img = self.p_sample(img, torch.full((batch_size,), i, device = device, dtype = torch.long), text_cond = text_cond, cond_scale = cond_scale)
        return img

-    @torch.inference_mode()
+    @torch.no_grad()
    @eval_decorator
-    def sample(self, text, num_samples_per_batch = 2):
+    def sample(self, text, num_samples_per_batch = 2, cond_scale = 1.):
        # in the paper, what they did was
        # sample 2 image embeddings, choose the top 1 similarity, as judged by CLIP
        text = repeat(text, 'b ... -> (b r) ...', r = num_samples_per_batch)
@@ -1007,7 +1013,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        if self.condition_on_text_encodings:
            text_cond = {**text_cond, 'text_encodings': text_encodings, 'mask': text_mask}

-        image_embeds = self.p_sample_loop((batch_size, image_embed_dim), text_cond = text_cond)
+        image_embeds = self.p_sample_loop((batch_size, image_embed_dim), text_cond = text_cond, cond_scale = cond_scale)

        # retrieve original unscaled image embed

@@ -1691,7 +1697,8 @@ class Decoder(BaseGaussianDiffusion):
        clip_adapter_overrides = dict(),
        learned_variance = True,
        vb_loss_weight = 0.001,
-        unconditional = False
+        unconditional = False,
+        auto_normalize_img = True,                  # whether to take care of normalizing the image from [0, 1] to [-1, 1] and back automatically - you can turn this off if you want to pass in the [-1, 1] ranged image yourself from the dataloader
    ):
        super().__init__(
            beta_schedule = beta_schedule,
@@ -1793,12 +1800,17 @@ class Decoder(BaseGaussianDiffusion):

        self.image_cond_drop_prob = image_cond_drop_prob
        self.text_cond_drop_prob = text_cond_drop_prob
+        self.can_classifier_guidance = image_cond_drop_prob > 0. or text_cond_drop_prob > 0.

        # whether to clip when sampling

        self.clip_denoised = clip_denoised
        self.clip_x_start = clip_x_start

+        # normalize and unnormalize image functions
+        self.normalize_img = normalize_neg_one_to_one if auto_normalize_img else identity
+        self.unnormalize_img = unnormalize_zero_to_one if auto_normalize_img else identity
+
    def get_unet(self, unet_number):
        assert 0 < unet_number <= len(self.unets)
        index = unet_number - 1
@@ -1812,13 +1824,19 @@ class Decoder(BaseGaussianDiffusion):
            unet = self.get_unet(unet_number)

        self.cuda()
-        self.unets.cpu()

+        devices = [module_device(unet) for unet in self.unets]
+        self.unets.cpu()
        unet.cuda()
+
        yield
-        unet.cpu()
+
+        for unet, device in zip(self.unets, devices):
+            unet.to(device)

    def p_mean_variance(self, unet, x, t, image_embed, text_encodings = None, text_mask = None, lowres_cond_img = None, clip_denoised = True, predict_x_start = False, learned_variance = False, cond_scale = 1., model_output = None):
+        assert not (cond_scale != 1. and not self.can_classifier_guidance), 'the decoder was not trained with conditional dropout, and thus one cannot use classifier free guidance (cond_scale anything other than 1)'
+
        pred = default(model_output, lambda: unet.forward_with_cond_scale(x, t, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = cond_scale, lowres_cond_img = lowres_cond_img))

        if learned_variance:
@@ -1847,7 +1865,7 @@ class Decoder(BaseGaussianDiffusion):

        return model_mean, posterior_variance, posterior_log_variance

-    @torch.inference_mode()
+    @torch.no_grad()
    def p_sample(self, unet, x, t, image_embed, text_encodings = None, text_mask = None, cond_scale = 1., lowres_cond_img = None, predict_x_start = False, learned_variance = False, clip_denoised = True, repeat_noise = False):
        b, *_, device = *x.shape, x.device
        model_mean, _, model_log_variance = self.p_mean_variance(unet, x = x, t = t, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = cond_scale, lowres_cond_img = lowres_cond_img, clip_denoised = clip_denoised, predict_x_start = predict_x_start, learned_variance = learned_variance)
@@ -1856,14 +1874,15 @@ class Decoder(BaseGaussianDiffusion):
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise

-    @torch.inference_mode()
-    def p_sample_loop(self, unet, shape, image_embed, predict_x_start = False, learned_variance = False, clip_denoised = True, lowres_cond_img = None, text_encodings = None, text_mask = None, cond_scale = 1):
+    @torch.no_grad()
+    def p_sample_loop(self, unet, shape, image_embed, predict_x_start = False, learned_variance = False, clip_denoised = True, lowres_cond_img = None, text_encodings = None, text_mask = None, cond_scale = 1, is_latent_diffusion = False):
        device = self.betas.device

        b = shape[0]
        img = torch.randn(shape, device = device)

-        lowres_cond_img = maybe(normalize_neg_one_to_one)(lowres_cond_img)
+        if not is_latent_diffusion:
+            lowres_cond_img = maybe(self.normalize_img)(lowres_cond_img)

        for i in tqdm(reversed(range(0, self.num_timesteps)), desc = 'sampling loop time step', total = self.num_timesteps):
            img = self.p_sample(
@@ -1880,16 +1899,17 @@ class Decoder(BaseGaussianDiffusion):
                clip_denoised = clip_denoised
            )

-        unnormalize_img = unnormalize_zero_to_one(img)
+        unnormalize_img = self.unnormalize_img(img)
        return unnormalize_img

-    def p_losses(self, unet, x_start, times, *, image_embed, lowres_cond_img = None, text_encodings = None, text_mask = None, predict_x_start = False, noise = None, learned_variance = False, clip_denoised = False):
+    def p_losses(self, unet, x_start, times, *, image_embed, lowres_cond_img = None, text_encodings = None, text_mask = None, predict_x_start = False, noise = None, learned_variance = False, clip_denoised = False, is_latent_diffusion = False):
        noise = default(noise, lambda: torch.randn_like(x_start))

        # normalize to [-1, 1]

-        x_start = normalize_neg_one_to_one(x_start)
-        lowres_cond_img = maybe(normalize_neg_one_to_one)(lowres_cond_img)
+        if not is_latent_diffusion:
+            x_start = self.normalize_img(x_start)
+            lowres_cond_img = maybe(self.normalize_img)(lowres_cond_img)

        # get x_t

@@ -1949,12 +1969,14 @@ class Decoder(BaseGaussianDiffusion):

        return loss + vb_loss

-    @torch.inference_mode()
+    @torch.no_grad()
    @eval_decorator
    def sample(
        self,
        image_embed = None,
        text = None,
+        text_mask = None,
+        text_encodings = None,
        batch_size = 1,
        cond_scale = 1.,
        stop_at_unet_number = None
@@ -1964,8 +1986,8 @@ class Decoder(BaseGaussianDiffusion):
        if not self.unconditional:
            batch_size = image_embed.shape[0]

-        text_encodings = text_mask = None
-        if exists(text):
+        if exists(text) and not exists(text_encodings) and not self.unconditional:
+            assert exists(self.clip)
            _, text_encodings, text_mask = self.clip.embed_text(text)

        assert not (self.condition_on_text_encodings and not exists(text_encodings)), 'text or text encodings must be passed into decoder if specified'
@@ -2001,7 +2023,8 @@ class Decoder(BaseGaussianDiffusion):
                    predict_x_start = predict_x_start,
                    learned_variance = learned_variance,
                    clip_denoised = not is_latent_diffusion,
-                    lowres_cond_img = lowres_cond_img
+                    lowres_cond_img = lowres_cond_img,
+                    is_latent_diffusion = is_latent_diffusion
                )

                img = vae.decode(img)
@@ -2017,6 +2040,7 @@ class Decoder(BaseGaussianDiffusion):
        text = None,
        image_embed = None,
        text_encodings = None,
+        text_mask = None,
        unet_number = None
    ):
        assert not (len(self.unets) > 1 and not exists(unet_number)), f'you must specify which unet you want trained, from a range of 1 to {len(self.unets)}, if you are training cascading DDPM (multiple unets)'
@@ -2041,7 +2065,6 @@ class Decoder(BaseGaussianDiffusion):
            assert exists(self.clip), 'if you want to derive CLIP image embeddings automatically, you must supply `clip` to the decoder on init'
            image_embed, _ = self.clip.embed_image(image)

-        text_encodings = text_mask = None
        if exists(text) and not exists(text_encodings) and not self.unconditional:
            assert exists(self.clip), 'if you are passing in raw text, you need to supply `clip` to the decoder'
            _, text_encodings, text_mask = self.clip.embed_text(text)
@@ -2060,12 +2083,14 @@ class Decoder(BaseGaussianDiffusion):
            image = aug(image)
            lowres_cond_img = aug(lowres_cond_img, params = aug._params)

+        is_latent_diffusion = not isinstance(vae, NullVQGanVAE)
+
        vae.eval()
        with torch.no_grad():
            image = vae.encode(image)
            lowres_cond_img = maybe(vae.encode)(lowres_cond_img)

-        return self.p_losses(unet, image, times, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, lowres_cond_img = lowres_cond_img, predict_x_start = predict_x_start, learned_variance = learned_variance)
+        return self.p_losses(unet, image, times, image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, lowres_cond_img = lowres_cond_img, predict_x_start = predict_x_start, learned_variance = learned_variance, is_latent_diffusion = is_latent_diffusion)

 # main class

@@ -2088,22 +2113,23 @@ class DALLE2(nn.Module):

        self.to_pil = T.ToPILImage()

-    @torch.inference_mode()
+    @torch.no_grad()
    @eval_decorator
    def forward(
        self,
        text,
        cond_scale = 1.,
+        prior_cond_scale = 1.,
        return_pil_images = False
    ):
-        device = next(self.parameters()).device
+        device = module_device(self)
        one_text = isinstance(text, str) or (not is_list_str(text) and text.shape[0] == 1)

        if isinstance(text, str) or is_list_str(text):
            text = [text] if not isinstance(text, (list, tuple)) else text
            text = tokenizer.tokenize(text).to(device)

-        image_embed = self.prior.sample(text, num_samples_per_batch = self.prior_num_samples)
+        image_embed = self.prior.sample(text, num_samples_per_batch = self.prior_num_samples, cond_scale = prior_cond_scale)

        text_cond = text if self.decoder_need_text_cond else None
        images = self.decoder.sample(image_embed, text = text_cond, cond_scale = cond_scale)
--- a/dalle2_pytorch/dataloaders/README.md
+++ b/dalle2_pytorch/dataloaders/README.md
@@ -0,0 +1,41 @@
+## Dataloaders
+In order to make loading data simple and efficient, we include some general dataloaders that can be used to train portions of the network.
+
+### Decoder: Image Embedding Dataset
+When training the decoder (and up samplers if training together) in isolation, you will need to load images and corresponding image embeddings. This dataset can read two similar types of datasets. First, it can read a [webdataset](https://github.com/webdataset/webdataset) that contains `.jpg` and `.npy` files in the `.tar`s that contain the images and associated image embeddings respectively. Alternatively, you can also specify a source for the embeddings outside of the webdataset. In this case, the path to the embeddings should contain `.npy` files with the same shard numbers as the webdataset and there should be a correspondence between the filename of the `.jpg` and the index of the embedding in the `.npy`. So, for example, `0001.tar` from the webdataset with image `00010509.jpg` (the first 4 digits are the shard number and the last 4 are the index) in it should be paralleled by a `img_emb_0001.npy` which contains a NumPy array with the embedding at index 509.
+
+Generating a dataset of this type: 
+1. Use [img2dataset](https://github.com/rom1504/img2dataset) to generate a webdataset.
+2. Use [clip-retrieval](https://github.com/rom1504/clip-retrieval) to convert the images to embeddings.
+3. Use [embedding-dataset-reordering](https://github.com/Veldrovive/embedding-dataset-reordering) to reorder the embeddings into the expected format.
+
+Usage:
+```python
+from dalle2_pytorch.dataloaders import ImageEmbeddingDataset, create_image_embedding_dataloader
+
+# Create a dataloader directly.
+dataloader = create_image_embedding_dataloader(
+    tar_url="/path/or/url/to/webdataset/{0000..9999}.tar", # Uses braket expanding notation. This specifies to read all tars from 0000.tar to 9999.tar
+    embeddings_url="path/or/url/to/embeddings/folder",     # Included if .npy files are not in webdataset. Left out or set to None otherwise
+    num_workers=4,
+    batch_size=32,
+    shard_width=4,                                         # If a file in the webdataset shard 3 is named 0003039.jpg, we know the shard width is 4 and the last three digits are the index
+    shuffle_num=200,                                       # Does a shuffle of the data with a buffer size of 200
+    shuffle_shards=True,                                   # Shuffle the order the shards are read in
+    resample_shards=False,                                 # Sample shards with replacement. If true, an epoch will be infinite unless stopped manually
+)
+for img, emb in dataloader:
+    print(img.shape)  # torch.Size([32, 3, 256, 256])
+    print(emb.shape)  # torch.Size([32, 512])
+    # Train decoder only as shown above
+
+# Or create a dataset without a loader so you can configure it manually
+dataset = ImageEmbeddingDataset(
+    urls="/path/or/url/to/webdataset/{0000..9999}.tar",
+    embedding_folder_url="path/or/url/to/embeddings/folder",
+    shard_width=4,
+    shuffle_shards=True,
+    resample=False
+)
+```
+
--- a/dalle2_pytorch/dataloaders/init.py
+++ b/dalle2_pytorch/dataloaders/init.py
@@ -1 +1,2 @@
-from dalle2_pytorch.dataloaders.decoder_loader import ImageEmbeddingDataset, create_image_embedding_dataloader
+from dalle2_pytorch.dataloaders.decoder_loader import ImageEmbeddingDataset, create_image_embedding_dataloader
+from dalle2_pytorch.dataloaders.embedding_wrapper import make_splits
--- a/dalle2_pytorch/dataloaders/decoder_loader.py
+++ b/dalle2_pytorch/dataloaders/decoder_loader.py
@@ -3,6 +3,7 @@ import webdataset as wds
 import torch
 import numpy as np
 import fsspec
+import shutil

 def get_shard(filename):
    """
@@ -20,7 +21,7 @@ def get_example_file(fs, path, file_format):
    """
    return fs.glob(os.path.join(path, f"*.{file_format}"))[0]

-def embedding_inserter(samples, embeddings_url, shard_width, handler=wds.handlers.reraise_exception):
+def embedding_inserter(samples, embeddings_url, index_width, handler=wds.handlers.reraise_exception):
    """Given a datum of {"__key__": str, "__url__": str, ...} adds the cooresponding embedding and yields"""
    previous_tar_url = None
    current_embeddings = None
@@ -50,8 +51,12 @@ def embedding_inserter(samples, embeddings_url, shard_width, handler=wds.handler
                previous_tar_url = tar_url
                current_embeddings = load_corresponding_embeds(tar_url)
                
-            embedding_index = int(key[shard_width:])
-            sample["npy"] = current_embeddings[embedding_index]
+            embedding_index = int(key[-index_width:])
+            embedding = current_embeddings[embedding_index]
+            # We need to check if this sample is nonzero. If it is, this embedding is not valid and we should continue to the next loop
+            if torch.count_nonzero(embedding) == 0:
+                raise RuntimeError(f"Webdataset had a sample, but no embedding was found. ImgShard: {key[:-index_width]} - Index: {key[-index_width:]}")
+            sample["npy"] = embedding
            yield sample
        except Exception as exn:  # From wds implementation
            if handler(exn):
@@ -60,6 +65,28 @@ def embedding_inserter(samples, embeddings_url, shard_width, handler=wds.handler
                break
 insert_embedding = wds.filters.pipelinefilter(embedding_inserter)

+def unassociated_shard_skipper(tarfiles, embeddings_url, handler=wds.handlers.reraise_exception):
+    """Finds if the is a corresponding embedding for the tarfile at { url: [URL] }"""
+    embeddings_fs, embeddings_path = fsspec.core.url_to_fs(embeddings_url)
+    embedding_files = embeddings_fs.ls(embeddings_path)
+    get_embedding_shard = lambda embedding_file: int(embedding_file.split("_")[-1].split(".")[0])
+    embedding_shards = set([get_embedding_shard(filename) for filename in embedding_files])  # Sets have O(1) check for member
+
+    get_tar_shard = lambda tar_file: int(tar_file.split("/")[-1].split(".")[0])
+    for tarfile in tarfiles:
+        try:
+            webdataset_shard = get_tar_shard(tarfile["url"])
+            # If this shard has an associated embeddings file, we pass it through. Otherwise we iterate until we do have one
+            if webdataset_shard in embedding_shards:
+                yield tarfile
+        except Exception as exn:  # From wds implementation
+            if handler(exn):
+                continue
+            else:
+                break
+    
+skip_unassociated_shards = wds.filters.pipelinefilter(unassociated_shard_skipper)
+
 def verify_keys(samples, handler=wds.handlers.reraise_exception):
    """
    Requires that both the image and embedding are present in the sample
@@ -86,7 +113,9 @@ class ImageEmbeddingDataset(wds.DataPipeline, wds.compat.FluidInterface):
            self,
            urls,
            embedding_folder_url=None,
-            shard_width=None,
+            index_width=None,
+            img_preproc=None,
+            extra_keys=[],
            handler=wds.handlers.reraise_exception,
            resample=False,
            shuffle_shards=True
@@ -97,13 +126,31 @@ class ImageEmbeddingDataset(wds.DataPipeline, wds.compat.FluidInterface):
        :param urls: A url pointing to the tar files of the webdataset formatted as /path/to/webdataset/{0000..9999}.tar
        :param embedding_folder_url: Required if webdataset does not contain embeddings. A url pointing to the npy files of the embeddings. Should have the same number of shards as the webdataset.
            Webdataset image keys should align with the index of the embedding. This means missing image indices must have a corresponding embedding of all zeros.
-        :param shard_width: The number of digits in the shard number. This is used to align the embedding index with the image index.
-            For example, if a file in the webdataset shard 3 is named 0003039.jpg, we know the shard with this 4 and the last three digits are the index.
+        :param index_width: The number of digits in the index. This is used to align the embedding index with the image index.
+            For example, if a file in the webdataset shard 3 is named 0003039.jpg, we know the shard is 4 digits and the last 3 digits are the index_width.
+        :param img_preproc: This function is run on the img before it is batched and returned. Useful for data augmentation or converting to torch tensor.
        :param handler: A webdataset handler.
        :param resample: If true, resample webdataset shards with replacement. You need to set your own epoch size if this is true since it will resample infinitely.
        :param shuffle_shards: If true, shuffle the shards before resampling. This cannot be true if resample is true.
+
+
        """
        super().__init__()
+        keys = ["jpg", "npy"] + extra_keys
+        self.key_map = {key: i for i, key in enumerate(keys)}
+        self.resampling = resample
+        self.img_preproc = img_preproc
+        # If s3, check if s3fs is installed and s3cmd is installed and check if the data is piped instead of straight up
+        if (isinstance(urls, str) and "s3:" in urls) or (isinstance(urls, list) and any(["s3:" in url for url in urls])):
+            # Then this has an s3 link for the webdataset and we need extra packages
+            if shutil.which("s3cmd") is None:
+                raise RuntimeError("s3cmd is required for s3 webdataset")
+        if "s3:" in embedding_folder_url:
+            # Then the embeddings are being loaded from s3 and fsspec requires s3fs
+            try:
+                import s3fs
+            except ImportError:
+                raise RuntimeError("s3fs is required to load embeddings from s3")
        # Add the shardList and randomize or resample if requested
        if resample:
            assert not shuffle_shards, "Cannot both resample and shuffle"
@@ -112,28 +159,43 @@ class ImageEmbeddingDataset(wds.DataPipeline, wds.compat.FluidInterface):
            self.append(wds.SimpleShardList(urls))
            if shuffle_shards:
                self.append(wds.filters.shuffle(1000))
+        
+        if embedding_folder_url is not None:
+            # There may be webdataset shards that do not have a embedding shard associated with it. If we do not skip these, they would cause issues.
+            self.append(skip_unassociated_shards(embeddings_url=embedding_folder_url, handler=handler))

        self.append(wds.split_by_node)
        self.append(wds.split_by_worker)

        self.append(wds.tarfile_to_samples(handler=handler))
-        self.append(wds.decode("torchrgb"))
+        self.append(wds.decode("pilrgb", handler=handler))
        if embedding_folder_url is not None:
-            assert shard_width is not None, "Reading embeddings separately requires shard length to be given"
-            self.append(insert_embedding(embeddings_url=embedding_folder_url, shard_width=shard_width, handler=handler))
+            # Then we are loading embeddings for a remote source
+            assert index_width is not None, "Reading embeddings separately requires index width length to be given"
+            self.append(insert_embedding(embeddings_url=embedding_folder_url, index_width=index_width, handler=handler))
        self.append(verify_keys)
-        self.append(wds.to_tuple("jpg", "npy"))
+        # Apply preprocessing
+        self.append(wds.map(self.preproc))
+        self.append(wds.to_tuple(*keys))
+
+    def preproc(self, sample):
+        """Applies the preprocessing for images"""
+        if self.img_preproc is not None:
+            sample["jpg"] = self.img_preproc(sample["jpg"])
+        return sample

 def create_image_embedding_dataloader(
    tar_url,
    num_workers,
    batch_size,
    embeddings_url=None,
-    shard_width=None,
+    index_width=None,
    shuffle_num = None,
    shuffle_shards = True,
    resample_shards = False, 
-    handler=wds.handlers.warn_and_continue
+    img_preproc=None,
+    extra_keys=[],
+    handler=wds.handlers.reraise_exception#warn_and_continue
 ):
    """
    Convenience function to create an image embedding dataseta and dataloader in one line
@@ -143,8 +205,8 @@ def create_image_embedding_dataloader(
    :param batch_size: The batch size to use for the dataloader
    :param embeddings_url: Required if webdataset does not contain embeddings. A url pointing to the npy files of the embeddings. Should have the same number of shards as the webdataset.
        Webdataset image keys should align with the index of the embedding. This means missing image indices must have a corresponding embedding of all zeros.
-    :param shard_width: The number of digits in the shard number. This is used to align the embedding index with the image index.
-        For example, if a file in the webdataset shard 3 is named 0003039.jpg, we know the shard width is 4 and the last three digits are the index.
+    :param index_width: The number of digits in the index. This is used to align the embedding index with the image index.
+            For example, if a file in the webdataset shard 3 is named 0003039.jpg, we know the shard is 4 digits and the last 3 digits are the index_width.
    :param shuffle_num: If not None, shuffle the dataset with this size buffer after sampling.
    :param shuffle_shards: If true, shuffle the shards before sampling. This cannot be true if resample is true.
    :param resample_shards: If true, resample webdataset shards with replacement. You need to set your own epoch size if this is true since it will resample infinitely.
@@ -153,9 +215,11 @@ def create_image_embedding_dataloader(
    ds = ImageEmbeddingDataset(
        tar_url,
        embeddings_url,
-        shard_width=shard_width,
+        index_width=index_width,
        shuffle_shards=shuffle_shards,
        resample=resample_shards,
+        extra_keys=extra_keys,
+        img_preproc=img_preproc,
        handler=handler
    )
    if shuffle_num is not None and shuffle_num > 0:
--- a/dalle2_pytorch/dataloaders/embedding_wrapper.py
+++ b/dalle2_pytorch/dataloaders/embedding_wrapper.py
@@ -0,0 +1,180 @@
+from torch.utils.data import IterableDataset
+from torch import from_numpy
+from clip import tokenize
+from embedding_reader import EmbeddingReader
+
+
+class PriorEmbeddingLoader(IterableDataset):
+    def __init__(
+        self,
+        text_conditioned: bool,
+        batch_size: int,
+        start: int,
+        stop: int,
+        image_reader,
+        text_reader: EmbeddingReader = None,
+        device: str = "cpu",
+    ) -> None:
+        super(PriorEmbeddingLoader).__init__()
+
+        self.text_conditioned = text_conditioned
+
+        if not self.text_conditioned:
+            self.text_reader = text_reader
+
+        self.image_reader = image_reader
+        self.batch_size = batch_size
+        self.start = start
+        self.stop = stop
+        self.device = device
+
+    def __iter__(self):
+        self.n = 0
+        loader_args = dict(
+            batch_size=self.batch_size,
+            start=self.start,
+            end=self.stop,
+            show_progress=False,
+        )
+        if self.text_conditioned:
+            self.loader = self.image_reader(**loader_args)
+        else:
+            self.loader = zip(
+                self.image_reader(**loader_args), self.text_reader(**loader_args)
+            )
+        return self
+
+    def __next__(self):
+        try:
+            return self.get_sample()
+        except StopIteration:
+            raise StopIteration
+
+    def get_sample(self):
+        """
+        pre-proocess data from either reader into a common format
+        """
+        self.n += 1
+
+        if self.text_conditioned:
+            image_embedding, caption = next(self.loader)
+
+            image_embedding = from_numpy(image_embedding).to(self.device)
+            tokenized_caption = tokenize(
+                caption["caption"].to_list(), truncate=True
+            ).to(self.device)
+
+            return image_embedding, tokenized_caption
+
+        else:
+            (image_embedding, _), (text_embedding, _) = next(self.loader)
+
+            image_embedding = from_numpy(image_embedding).to(self.device)
+            text_embedding = from_numpy(text_embedding).to(self.device)
+
+            return image_embedding, text_embedding
+
+
+def make_splits(
+    text_conditioned: bool,
+    batch_size: int,
+    num_data_points: int,
+    train_split: float,
+    eval_split: float,
+    device: str,
+    img_url: str,
+    meta_url: str = None,
+    txt_url: str = None,
+):
+
+    assert img_url is not None, "Must supply some image embeddings"
+
+    if text_conditioned:
+        assert meta_url is not None, "Must supply metadata url if text-conditioning"
+        image_reader = EmbeddingReader(
+            embeddings_folder=img_url,
+            file_format="parquet_npy",
+            meta_columns=["caption"],
+            metadata_folder=meta_url,
+        )
+
+        # compute split points
+        if num_data_points > image_reader.count:
+            print("Specified point count is larger than the number of points available...defaulting to max length of reader.")
+            num_data_points = image_reader.count
+
+        train_set_size = int(train_split * num_data_points)
+        eval_set_size = int(eval_split * num_data_points)
+        eval_stop = int(train_set_size + eval_set_size)
+
+        train_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            batch_size=batch_size,
+            start=0,
+            stop=train_set_size,
+            device=device,
+        )
+        eval_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            batch_size=batch_size,
+            start=train_set_size,
+            stop=eval_stop,
+            device=device,
+        )
+        test_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            batch_size=batch_size,
+            start=eval_stop,
+            stop=int(num_data_points),
+            device=device,
+        )
+
+    else:
+        assert (
+            txt_url is not None
+        ), "Must supply text embedding url if not text-conditioning"
+
+        image_reader = EmbeddingReader(img_url, file_format="npy")
+        text_reader = EmbeddingReader(txt_url, file_format="npy")
+
+        # compute split points
+        if num_data_points > image_reader.count:
+            print("Specified point count is larger than the number of points available...defaulting to max length of reader.")
+            num_data_points = image_reader.count
+
+        train_set_size = int(train_split * num_data_points)
+        eval_set_size = int(eval_split * num_data_points)
+        eval_stop = int(train_set_size + eval_set_size)
+
+        train_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            text_reader=text_reader,
+            batch_size=batch_size,
+            start=0,
+            stop=train_set_size,
+            device=device,
+        )
+        eval_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            text_reader=text_reader,
+            batch_size=batch_size,
+            start=train_set_size,
+            stop=eval_stop,
+            device=device,
+        )
+        test_loader = PriorEmbeddingLoader(
+            text_conditioned=text_conditioned,
+            image_reader=image_reader,
+            text_reader=text_reader,
+            batch_size=batch_size,
+            start=eval_stop,
+            stop=int(num_data_points),
+            device=device,
+        )
+
+    return train_loader, eval_loader, test_loader
--- a/dalle2_pytorch/dataloaders/simple_image_only_dataloader.py
+++ b/dalle2_pytorch/dataloaders/simple_image_only_dataloader.py
@@ -0,0 +1,59 @@
+from pathlib import Path
+
+import torch
+from torch.utils import data
+from torchvision import transforms, utils
+
+from PIL import Image
+
+# helpers functions
+
+def cycle(dl):
+    while True:
+        for data in dl:
+            yield data
+
+# dataset and dataloader
+
+class Dataset(data.Dataset):
+    def __init__(
+        self,
+        folder,
+        image_size,
+        exts = ['jpg', 'jpeg', 'png']
+    ):
+        super().__init__()
+        self.folder = folder
+        self.image_size = image_size
+        self.paths = [p for ext in exts for p in Path(f'{folder}').glob(f'**/*.{ext}')]
+
+        self.transform = transforms.Compose([
+            transforms.Resize(image_size),
+            transforms.RandomHorizontalFlip(),
+            transforms.CenterCrop(image_size),
+            transforms.ToTensor()
+        ])
+
+    def __len__(self):
+        return len(self.paths)
+
+    def __getitem__(self, index):
+        path = self.paths[index]
+        img = Image.open(path)
+        return self.transform(img)
+
+def get_images_dataloader(
+    folder,
+    *,
+    batch_size,
+    image_size,
+    shuffle = True,
+    cycle_dl = True,
+    pin_memory = True
+):
+    ds = Dataset(folder, image_size)
+    dl = data.DataLoader(ds, batch_size = batch_size, shuffle = shuffle, pin_memory = pin_memory)
+
+    if cycle_dl:
+        dl = cycle(dl)
+    return dl
--- a/dalle2_pytorch/optimizer.py
+++ b/dalle2_pytorch/optimizer.py
@@ -7,7 +7,7 @@ def separate_weight_decayable_params(params):

 def get_optimizer(
    params,
-    lr = 2e-5,
+    lr = 1e-4,
    wd = 1e-2,
    betas = (0.9, 0.999),
    eps = 1e-8,
--- a/dalle2_pytorch/trackers.py
+++ b/dalle2_pytorch/trackers.py
@@ -1,17 +1,47 @@
 import os
+from pathlib import Path
+from enum import Enum
+import importlib
+from itertools import zip_longest
+
 import torch
 from torch import nn

+# constants
+
+DEFAULT_DATA_PATH = './.tracker-data'
+
 # helper functions

 def exists(val):
    return val is not None

+def import_or_print_error(pkg_name, err_str = None):
+    try:
+        return importlib.import_module(pkg_name)
+    except ModuleNotFoundError as e:
+        if exists(err_str):
+            print(err_str)
+        exit()
+
+# load state dict functions
+
+def load_wandb_state_dict(run_path, file_path, **kwargs):
+    wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb recall function')
+    file_reference = wandb.restore(file_path, run_path=run_path)
+    return torch.load(file_reference.name)
+
+def load_local_state_dict(file_path, **kwargs):
+    return torch.load(file_path)
+
 # base class

 class BaseTracker(nn.Module):
-    def __init__(self):
+    def __init__(self, data_path = DEFAULT_DATA_PATH):
        super().__init__()
+        assert data_path is not None, "Tracker must have a data_path to save local content"
+        self.data_path = Path(data_path)
+        self.data_path.mkdir(parents = True, exist_ok = True)

    def init(self, config, **kwargs):
        raise NotImplementedError
@@ -19,6 +49,27 @@ class BaseTracker(nn.Module):
    def log(self, log, **kwargs):
        raise NotImplementedError

+    def log_images(self, images, **kwargs):
+        raise NotImplementedError
+
+    def save_state_dict(self, state_dict, relative_path, **kwargs):
+        raise NotImplementedError
+
+    def recall_state_dict(self, recall_source, *args, **kwargs):
+        """
+        Loads a state dict from any source.
+        Since a user may wish to load a model from a different source than their own tracker (i.e. tracking using wandb but recalling from disk),
+            this should not be linked to any individual tracker.
+        """
+        # TODO: Pull this into a dict or something similar so that we can add more sources without having a massive switch statement
+        if recall_source == 'wandb':
+            return load_wandb_state_dict(*args, **kwargs)
+        elif recall_source == 'local':
+            return load_local_state_dict(*args, **kwargs)
+        else:
+            raise ValueError('`recall_source` must be one of `wandb` or `local`')
+
+
 # basic stdout class

 class ConsoleTracker(BaseTracker):
@@ -28,22 +79,39 @@ class ConsoleTracker(BaseTracker):
    def log(self, log, **kwargs):
        print(log)

+    def log_images(self, images, **kwargs): # noop for logging images
+        pass
+    
+    def save_state_dict(self, state_dict, relative_path, **kwargs):
+        torch.save(state_dict, str(self.data_path / relative_path))
+
 # basic wandb class

 class WandbTracker(BaseTracker):
-    def __init__(self):
-        super().__init__()
-        try:
-            import wandb
-        except ImportError as e:
-            print('`pip install wandb` to use the wandb experiment tracker')
-            raise e
-
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb experiment tracker')
        os.environ["WANDB_SILENT"] = "true"
-        self.wandb = wandb

    def init(self, **config):
        self.wandb.init(**config)

-    def log(self, log, **kwargs):
+    def log(self, log, verbose=False, **kwargs):
+        if verbose:
+            print(log)
        self.wandb.log(log, **kwargs)
+
+    def log_images(self, images, captions=[], image_section="images", **kwargs):
+        """
+        Takes a tensor of images and a list of captions and logs them to wandb.
+        """
+        wandb_images = [self.wandb.Image(image, caption=caption) for image, caption in zip_longest(images, captions)]
+        self.wandb.log({ image_section: wandb_images }, **kwargs)
+    
+    def save_state_dict(self, state_dict, relative_path, **kwargs):
+        """
+        Saves a state_dict to disk and uploads it 
+        """
+        full_path = str(self.data_path / relative_path)
+        torch.save(state_dict, full_path)
+        self.wandb.save(full_path, base_path = str(self.data_path))  # Upload and keep relative to data_path
--- a/dalle2_pytorch/trainer.py
+++ b/dalle2_pytorch/trainer.py
@@ -47,6 +47,14 @@ def groupby_prefix_and_trim(prefix, d):
    kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
    return kwargs_without_prefix, kwargs

+def num_to_groups(num, divisor):
+    groups = num // divisor
+    remainder = num % divisor
+    arr = [divisor] * groups
+    if remainder > 0:
+        arr.append(remainder)
+    return arr
+
 # decorators

 def cast_torch_tensor(fn):
@@ -179,8 +187,8 @@ class EMA(nn.Module):
        self.online_model = model
        self.ema_model = copy.deepcopy(model)

-        self.update_after_step = update_after_step # only start EMA after this step number, starting at 0
        self.update_every = update_every
+        self.update_after_step = update_after_step  // update_every # only start EMA after this step number, starting at 0

        self.register_buffer('initted', torch.Tensor([False]))
        self.register_buffer('step', torch.tensor([0.]))
@@ -189,14 +197,21 @@ class EMA(nn.Module):
        device = self.initted.device
        self.ema_model.to(device)

+    def copy_params_from_model_to_ema(self):
+        self.ema_model.state_dict(self.online_model.state_dict())
+
    def update(self):
        self.step += 1

-        if self.step <= self.update_after_step or (self.step % self.update_every) != 0:
+        if (self.step % self.update_every) != 0:
+            return
+
+        if self.step <= self.update_after_step:
+            self.copy_params_from_model_to_ema()
            return

        if not self.initted:
-            self.ema_model.state_dict(self.online_model.state_dict())
+            self.copy_params_from_model_to_ema()
            self.initted.data.copy_(torch.Tensor([True]))

        self.update_moving_average(self.ema_model, self.online_model)
@@ -220,6 +235,16 @@ class EMA(nn.Module):

 # diffusion prior trainer

+def prior_sample_in_chunks(fn):
+    @wraps(fn)
+    def inner(self, *args, max_batch_size = None, **kwargs):
+        if not exists(max_batch_size):
+            return fn(self, *args, **kwargs)
+
+        outputs = [fn(self, *chunked_args, **chunked_kwargs) for _, (chunked_args, chunked_kwargs) in split_args_and_kwargs(*args, split_size = max_batch_size, **kwargs)]
+        return torch.cat(outputs, dim = 0)
+    return inner
+
 class DiffusionPriorTrainer(nn.Module):
    def __init__(
        self,
@@ -278,17 +303,19 @@ class DiffusionPriorTrainer(nn.Module):

        self.step += 1

-    @torch.inference_mode()
+    @torch.no_grad()
    @cast_torch_tensor
+    @prior_sample_in_chunks
    def p_sample_loop(self, *args, **kwargs):
        return self.ema_diffusion_prior.ema_model.p_sample_loop(*args, **kwargs)

-    @torch.inference_mode()
+    @torch.no_grad()
    @cast_torch_tensor
+    @prior_sample_in_chunks
    def sample(self, *args, **kwargs):
        return self.ema_diffusion_prior.ema_model.sample(*args, **kwargs)

-    @torch.inference_mode()
+    @torch.no_grad()
    def sample_batch_size(self, *args, **kwargs):
        return self.ema_diffusion_prior.ema_model.sample_batch_size(*args, **kwargs)

@@ -315,15 +342,31 @@ class DiffusionPriorTrainer(nn.Module):

 # decoder trainer

+def decoder_sample_in_chunks(fn):
+    @wraps(fn)
+    def inner(self, *args, max_batch_size = None, **kwargs):
+        if not exists(max_batch_size):
+            return fn(self, *args, **kwargs)
+
+        if self.decoder.unconditional:
+            batch_size = kwargs.get('batch_size')
+            batch_sizes = num_to_groups(batch_size, max_batch_size)
+            outputs = [fn(self, *args, **{**kwargs, 'batch_size': sub_batch_size}) for sub_batch_size in batch_sizes]
+        else:
+            outputs = [fn(self, *chunked_args, **chunked_kwargs) for _, (chunked_args, chunked_kwargs) in split_args_and_kwargs(*args, split_size = max_batch_size, **kwargs)]
+
+        return torch.cat(outputs, dim = 0)
+    return inner
+
 class DecoderTrainer(nn.Module):
    def __init__(
        self,
        decoder,
        use_ema = True,
-        lr = 2e-5,
+        lr = 1e-4,
        wd = 1e-2,
        eps = 1e-8,
-        max_grad_norm = None,
+        max_grad_norm = 0.5,
        amp = False,
        **kwargs
    ):
@@ -377,8 +420,11 @@ class DecoderTrainer(nn.Module):
        scaler = getattr(self, f'scaler{index}')
        return scaler.scale(loss)

-    def update(self, unet_number):
-        assert 1 <= unet_number <= self.num_unets
+    def update(self, unet_number = None):
+        if self.num_unets == 1:
+            unet_number = default(unet_number, 1)
+
+        assert exists(unet_number) and 1 <= unet_number <= self.num_unets
        index = unet_number - 1
        unet = self.decoder.unets[index]

@@ -401,15 +447,17 @@ class DecoderTrainer(nn.Module):

    @torch.no_grad()
    @cast_torch_tensor
+    @decoder_sample_in_chunks
    def sample(self, *args, **kwargs):
-        if self.use_ema:
-            trainable_unets = self.decoder.unets
-            self.decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling
+        if kwargs.pop('use_non_ema', False) or not self.use_ema:
+            return self.decoder.sample(*args, **kwargs)
+
+        trainable_unets = self.decoder.unets
+        self.decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling

        output = self.decoder.sample(*args, **kwargs)

-        if self.use_ema:
-            self.decoder.unets = trainable_unets             # restore original training unets
+        self.decoder.unets = trainable_unets             # restore original training unets

        # cast the ema_model unets back to original device
        for ema in self.ema_unets:
@@ -421,10 +469,13 @@ class DecoderTrainer(nn.Module):
    def forward(
        self,
        *args,
-        unet_number,
+        unet_number = None,
        max_batch_size = None,
        **kwargs
    ):
+        if self.num_unets == 1:
+            unet_number = default(unet_number, 1)
+
        total_loss = 0.

        for chunk_size_frac, (chunked_args, chunked_kwargs) in split_args_and_kwargs(*args, split_size = max_batch_size, **kwargs):
--- a/dalle2_pytorch/vqgan_vae_trainer.py
+++ b/dalle2_pytorch/vqgan_vae_trainer.py
--- a/samples/oxford.png
+++ b/samples/oxford.png
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.2.37',
+  version = '0.3.4',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -5,10 +5,13 @@ import time
 import numpy as np

 import torch
+import clip
 from torch import nn

-from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork
-from dalle2_pytorch.train import DiffusionPriorTrainer, load_diffusion_model, save_diffusion_model, print_ribbon
+from dalle2_pytorch.dataloaders import make_splits
+from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork, OpenAIClipAdapter
+from dalle2_pytorch.trainer import DiffusionPriorTrainer, load_diffusion_model, save_diffusion_model, print_ribbon
+
 from dalle2_pytorch.trackers import ConsoleTracker, WandbTracker

 from embedding_reader import EmbeddingReader
@@ -17,8 +20,7 @@ from tqdm import tqdm

 # constants

-NUM_TEST_EMBEDDINGS = 100 # for cosine similarity reporting during training
-REPORT_METRICS_EVERY = 100 # for cosine similarity and other metric reporting during training
+REPORT_METRICS_EVERY = 250 # for cosine similarity and other metric reporting during training

 tracker = WandbTracker()

@@ -36,81 +38,106 @@ class Timer:

    def elapsed(self):
        return time.time() - self.last_time
+
 # functions

-def eval_model(model,device,image_reader,text_reader,start,end,batch_size,loss_type,phase="Validation"):
+def eval_model(model, dataloader, text_conditioned, loss_type, phase="Validation"):
    model.eval()
+
    with torch.no_grad():
        total_loss = 0.
        total_samples = 0.

-        for emb_images, emb_text in zip(image_reader(batch_size=batch_size, start=start, end=end),
-                text_reader(batch_size=batch_size, start=start, end=end)):
+        for image_embeddings, text_data in tqdm(dataloader):

-            emb_images_tensor = torch.tensor(emb_images[0]).to(device)
-            emb_text_tensor = torch.tensor(emb_text[0]).to(device)
+            batches = image_embeddings.shape[0]

-            batches = emb_images_tensor.shape[0]
+            input_args = dict(image_embed=image_embeddings)
+            if text_conditioned:
+                input_args = dict(**input_args, text = text_data)
+            else:
+                input_args = dict(**input_args, text_embed=text_data)

-            loss = model(text_embed = emb_text_tensor, image_embed = emb_images_tensor)
+            loss = model(**input_args)

-            total_loss += loss.item() * batches
+            total_loss += loss * batches
            total_samples += batches

        avg_loss = (total_loss / total_samples)
+
        tracker.log({f'{phase} {loss_type}': avg_loss})

-def report_cosine_sims(diffusion_prior,image_reader,text_reader,train_set_size,NUM_TEST_EMBEDDINGS,device):
+def report_cosine_sims(diffusion_prior, dataloader, text_conditioned):
    diffusion_prior.eval()

    cos = nn.CosineSimilarity(dim=1, eps=1e-6)

-    tstart = train_set_size
-    tend = train_set_size+NUM_TEST_EMBEDDINGS
+    for test_image_embeddings, text_data in tqdm(dataloader):
+
+        # we are text conditioned, we produce an embedding from the tokenized text
+        if text_conditioned:
+            text_embedding, text_encodings, text_mask = diffusion_prior.clip.embed_text(
+                text_data)
+            text_cond = dict(text_embed=text_embedding,
+                             text_encodings=text_encodings, mask=text_mask)
+        else:
+            text_embedding = text_data
+            text_cond = dict(text_embed=text_embedding)
+
+        # make a copy of the text embeddings for shuffling
+        text_embed_shuffled = text_embedding.clone()
+
+        # roll the text to simulate "unrelated" captions
+        rolled_idx = torch.roll(torch.arange(text_embedding.shape[0]), 1)
+        text_embed_shuffled = text_embed_shuffled[rolled_idx]
+        text_embed_shuffled = text_embed_shuffled / \
+            text_embed_shuffled.norm(dim=1, keepdim=True)
+
+        if text_conditioned:
+            text_encodings_shuffled = text_encodings[rolled_idx]
+            text_mask_shuffled = text_mask[rolled_idx]
+        else:
+            text_encodings_shuffled = None
+            text_mask_shuffled = None
+
+        text_cond_shuffled = dict(text_embed=text_embed_shuffled,
+                                  text_encodings=text_encodings_shuffled, mask=text_mask_shuffled)

-    for embt, embi in zip(text_reader(batch_size=NUM_TEST_EMBEDDINGS, start=tstart, end=tend), 
-            image_reader(batch_size=NUM_TEST_EMBEDDINGS, start=tstart, end=tend)):
-       # make a copy of the text embeddings for shuffling
-       text_embed = torch.tensor(embt[0]).to(device)
-       text_embed_shuffled = text_embed.clone()
-        # roll the text embeddings to simulate "unrelated" captions
-       rolled_idx = torch.roll(torch.arange(NUM_TEST_EMBEDDINGS), 1)
-       text_embed_shuffled = text_embed_shuffled[rolled_idx]
-       text_embed_shuffled = text_embed_shuffled / \
-           text_embed_shuffled.norm(dim=1, keepdim=True)
-       test_text_shuffled_cond = dict(text_embed=text_embed_shuffled)
        # prepare the text embedding
-       text_embed = text_embed / text_embed.norm(dim=1, keepdim=True)
-       test_text_cond = dict(text_embed=text_embed)
+        text_embed = text_embedding / text_embedding.norm(dim=1, keepdim=True)
+
        # prepare image embeddings
-       test_image_embeddings = torch.tensor(embi[0]).to(device)
-       test_image_embeddings = test_image_embeddings / \
-           test_image_embeddings.norm(dim=1, keepdim=True)
+        test_image_embeddings = test_image_embeddings / \
+            test_image_embeddings.norm(dim=1, keepdim=True)
+
        # predict on the unshuffled text embeddings
-       predicted_image_embeddings = diffusion_prior.p_sample_loop(
-           (NUM_TEST_EMBEDDINGS, 768), text_cond=test_text_cond)
-       predicted_image_embeddings = predicted_image_embeddings / \
-           predicted_image_embeddings.norm(dim=1, keepdim=True)
+        predicted_image_embeddings = diffusion_prior.p_sample_loop(
+            test_image_embeddings.shape, text_cond)
+        predicted_image_embeddings = predicted_image_embeddings / \
+            predicted_image_embeddings.norm(dim=1, keepdim=True)
+
        # predict on the shuffled embeddings
-       predicted_unrelated_embeddings = diffusion_prior.p_sample_loop(
-           (NUM_TEST_EMBEDDINGS, 768), text_cond=test_text_shuffled_cond)
-       predicted_unrelated_embeddings = predicted_unrelated_embeddings / \
-           predicted_unrelated_embeddings.norm(dim=1, keepdim=True)
+        predicted_unrelated_embeddings = diffusion_prior.p_sample_loop(
+            test_image_embeddings.shape, text_cond_shuffled)
+        predicted_unrelated_embeddings = predicted_unrelated_embeddings / \
+            predicted_unrelated_embeddings.norm(dim=1, keepdim=True)
+
        # calculate similarities
-       original_similarity = cos(
+        original_similarity = cos(
           text_embed, test_image_embeddings).cpu().numpy()
-       predicted_similarity = cos(
+        predicted_similarity = cos(
           text_embed, predicted_image_embeddings).cpu().numpy()
-       unrelated_similarity = cos(
+        unrelated_similarity = cos(
           text_embed, predicted_unrelated_embeddings).cpu().numpy()
-       predicted_img_similarity = cos(
+        predicted_img_similarity = cos(
           test_image_embeddings, predicted_image_embeddings).cpu().numpy()
-       tracker.log({"CosineSimilarity(text_embed,image_embed)": np.mean(original_similarity),
+        tracker.log({"CosineSimilarity(text_embed,image_embed)": np.mean(original_similarity),
            "CosineSimilarity(text_embed,predicted_image_embed)":np.mean(predicted_similarity),
            "CosineSimilarity(orig_image_embed,predicted_image_embed)":np.mean(predicted_img_similarity),
            "CosineSimilarity(text_embed,predicted_unrelated_embed)": np.mean(unrelated_similarity),
            "Cosine similarity difference":np.mean(predicted_similarity - original_similarity)})

+
@click.command()
@click.option("--wandb-entity", default="laion")
@click.option("--wandb-project", default="diffusion-prior")
@@ -118,29 +145,32 @@ def report_cosine_sims(diffusion_prior,image_reader,text_reader,train_set_size,N
@click.option("--wandb-arch", default="DiffusionPrior")
@click.option("--image-embed-url", default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/")
@click.option("--text-embed-url", default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/")
+@click.option("--meta-url", default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/laion2B-en-metadata/")
@click.option("--learning-rate", default=1.1e-4)
@click.option("--weight-decay", default=6.02e-2)
@click.option("--dropout", default=5e-2)
@click.option("--max-grad-norm", default=0.5)
-@click.option("--batch-size", default=10**4)
+@click.option("--num-data-points", default=250e6)
+@click.option("--batch-size", default=320)
@click.option("--num-epochs", default=5)
@click.option("--image-embed-dim", default=768)
-@click.option("--train-percent", default=0.7)
-@click.option("--val-percent", default=0.2)
-@click.option("--test-percent", default=0.1)
-@click.option("--dpn-depth", default=6)
+@click.option("--train-percent", default=0.9)
+@click.option("--val-percent", default=1e-7)
+@click.option("--test-percent", default=0.0999999)
+@click.option("--dpn-depth", default=12)
@click.option("--dpn-dim-head", default=64)
-@click.option("--dpn-heads", default=8)
-@click.option("--dp-condition-on-text-encodings", default=False)
-@click.option("--dp-timesteps", default=100)
-@click.option("--dp-normformer", default=False)
+@click.option("--dpn-heads", default=12)
+@click.option("--dp-condition-on-text-encodings", default=True)
+@click.option("--dp-timesteps", default=1000)
+@click.option("--dp-normformer", default=True)
@click.option("--dp-cond-drop-prob", default=0.1)
@click.option("--dp-loss-type", default="l2")
-@click.option("--clip", default=None)
+@click.option("--clip", default="ViT-L/14")
@click.option("--amp", default=False)
-@click.option("--save-interval", default=30)
+@click.option("--save-interval", default=120)
@click.option("--save-path", default="./diffusion_prior_checkpoints")
@click.option("--pretrained-model-path", default=None)
+@click.option("--gpu-device", default=0)
 def train(
    wandb_entity,
    wandb_project,
@@ -148,10 +178,12 @@ def train(
    wandb_arch,
    image_embed_url,
    text_embed_url,
+    meta_url,
    learning_rate,
    weight_decay,
    dropout,
    max_grad_norm,
+    num_data_points,
    batch_size,
    num_epochs,
    image_embed_dim,
@@ -170,7 +202,8 @@ def train(
    amp,
    save_interval,
    save_path,
-    pretrained_model_path
+    pretrained_model_path,
+    gpu_device
 ):
    config = {
        "learning_rate": learning_rate,
@@ -197,7 +230,7 @@ def train(

    # Check if DPRIOR_PATH exists(saved model path)

-    DPRIOR_PATH = args.pretrained_model_path
+    DPRIOR_PATH = pretrained_model_path
    RESUME = exists(DPRIOR_PATH)

    if not RESUME:
@@ -211,7 +244,7 @@ def train(

    has_cuda = torch.cuda.is_available()
    if has_cuda:
-        device = torch.device("cuda:0")
+        device = torch.device(f"cuda:{gpu_device}")
        torch.cuda.set_device(device)

    # Training loop
@@ -227,11 +260,17 @@ def train(
        normformer = dp_normformer
    )
    
+    # Load clip model if text-conditioning
+    if dp_condition_on_text_encodings:
+        clip_adapter = OpenAIClipAdapter(clip)
+    else:
+        clip_adapter = None
+        
    # diffusion prior with text embeddings and image embeddings pre-computed

    diffusion_prior = DiffusionPrior( 
        net = prior_network,
-        clip = clip,
+        clip = clip_adapter,
        image_embed_dim = image_embed_dim,
        timesteps = dp_timesteps,
        cond_drop_prob = dp_cond_drop_prob,
@@ -265,33 +304,37 @@ def train(

    Path(save_path).mkdir(exist_ok = True, parents = True)

-    # Get image and text embeddings from the servers
+    # Utilize wrapper to abstract away loader logic
+    print_ribbon("Downloading Embeddings")
+    loader_args = dict(text_conditioned=dp_condition_on_text_encodings, batch_size=batch_size, num_data_points=num_data_points,
+                       train_split=train_percent, eval_split=val_percent, device=device, img_url=image_embed_url)

-    print_ribbon("Downloading embeddings - image and text")
-    image_reader = EmbeddingReader(embeddings_folder=image_embed_url, file_format="npy")
-    text_reader  = EmbeddingReader(embeddings_folder=text_embed_url, file_format="npy")
-    num_data_points = text_reader.count
+    if dp_condition_on_text_encodings:
+        loader_args = dict(**loader_args, meta_url=meta_url)
+    else:
+        loader_args = dict(**loader_args, txt_url=text_embed_url)
+
+    train_loader, eval_loader, test_loader = make_splits(**loader_args)

    ### Training code ###

+    step = 1 
    timer = Timer()
    epochs = num_epochs

-    train_set_size = int(train_percent*num_data_points)
-    val_set_size = int(val_percent*num_data_points)
-    eval_start = train_set_size
-
    for _ in range(epochs):

-        for emb_images,emb_text in zip(image_reader(batch_size=batch_size, start=0, end=train_set_size),
-                text_reader(batch_size=batch_size, start=0, end=train_set_size)):
-
-            trainer.train()
+        for image, text in tqdm(train_loader):
            
-            emb_images_tensor = torch.tensor(emb_images[0]).to(device)
-            emb_text_tensor = torch.tensor(emb_text[0]).to(device)
+            diffusion_prior.train()
+            
+            input_args = dict(image_embed=image)
+            if dp_condition_on_text_encodings:
+                input_args = dict(**input_args, text = text)
+            else:
+                input_args = dict(**input_args, text_embed=text)

-            loss = trainer(text_embed = emb_text_tensor, image_embed = emb_images_tensor)
+            loss = trainer(**input_args)

            # Samples per second

@@ -310,37 +353,23 @@ def train(
                    image_embed_dim)

            # Log to wandb
-            tracker.log({"Training loss": loss.item(),
+            tracker.log({"Training loss": loss,
                        "Steps": step,
                        "Samples per second": samples_per_sec})
            # Log cosineSim(text_embed,predicted_image_embed) - cosineSim(text_embed,image_embed)
            # Use NUM_TEST_EMBEDDINGS samples from the test set each time
            # Get embeddings from the most recently saved model
            if(step % REPORT_METRICS_EVERY) == 0:
-                report_cosine_sims(diffusion_prior,
-                        image_reader,
-                        text_reader,
-                        train_set_size,
-                        NUM_TEST_EMBEDDINGS,
-                        device)
+                report_cosine_sims(diffusion_prior, eval_loader, dp_condition_on_text_encodings)
                ### Evaluate model(validation run) ###
-                eval_model(diffusion_prior,
-                        device,
-                        image_reader,
-                        text_reader,
-                        eval_start,
-                        eval_start+NUM_TEST_EMBEDDINGS,
-                        NUM_TEST_EMBEDDINGS,
-                        dp_loss_type,
-                        phase="Validation")
+                eval_model(diffusion_prior, eval_loader, dp_condition_on_text_encodings, dp_loss_type, phase="Validation")

+            step += 1
            trainer.update()

    ### Test run ###
-    test_set_size = int(test_percent*train_set_size) 
-    start = train_set_size+val_set_size
-    end = num_data_points
-    eval_model(diffusion_prior,device,image_reader,text_reader,start,end,batch_size,dp_loss_type,phase="Test")
+    eval_model(diffusion_prior, test_loader, dp_condition_on_text_encodings, dp_loss_type, phase="Test")
+

 if __name__ == "__main__":
    train()
Author	SHA1	Message	Date
Phil Wang	9340d33d5f	fix wandb logging in tracker, and do some cleanup	2022-05-20 17:10:33 -07:00
Aidan Dempster	e0524a6aff	Implemented the wandb tracker (#106 ) Added a base_path parameter to all trackers for storing any local information they need to	2022-05-20 16:39:23 -07:00
Aidan Dempster	c85e0d5c35	Update decoder dataloader (#105 ) * Updated the decoder dataloader Removed unnecessary logging for required packages Transferred to using index width instead of shard width Added the ability to select extra keys to return from the webdataset * Added README for decoder loader	2022-05-20 16:38:55 -07:00
Phil Wang	db0642c4cd	quick fix for @marunine	2022-05-18 20:22:52 -07:00
Phil Wang	bb86ab2404	update sample, and set default gradient clipping value for decoder training	2022-05-16 17:38:30 -07:00
Phil Wang	ae056dd67c	samples	2022-05-16 13:46:35 -07:00
Phil Wang	033d6b0ce8	last update	2022-05-16 13:38:33 -07:00
Phil Wang	c7ea8748db	default decoder learning rate to what was in the paper	2022-05-16 13:33:54 -07:00
Phil Wang	13382885d9	final update to dalle2 repository for a while - sampling from prior in chunks automatically with max_batch_size keyword given	2022-05-16 12:57:31 -07:00
Phil Wang	c3d4a7ffe4	update working unconditional decoder example	2022-05-16 12:50:07 -07:00
Phil Wang	164d9be444	use a decorator and take care of sampling in chunks (max_batch_size keyword), in case one is sampling a huge grid of images	2022-05-16 12:34:28 -07:00
Phil Wang	5562ec6be2	status updates	2022-05-16 12:01:54 -07:00
Phil Wang	89ff04cfe2	final tweak to EMA class	2022-05-16 11:54:34 -07:00
Phil Wang	f4016f6302	allow for overriding use of EMA during sampling in decoder trainer with use_non_ema keyword, also fix some issues with automatic normalization of images and low res conditioning image if latent diffusion is in play	2022-05-16 11:18:30 -07:00
Phil Wang	1212f7058d	allow text encodings and text mask to be passed in on forward and sampling for Decoder class	2022-05-16 10:40:32 -07:00
Phil Wang	dab106d4e5	back to no_grad for now, also keep track and restore unet devices in one_unet_in_gpu contextmanager	2022-05-16 09:36:14 -07:00
Phil Wang	bb151ca6b1	unet_number on decoder trainer only needs to be passed in if there is greater than 1 unet, so that unconditional training of a single ddpm is seamless (experiment in progress locally)	2022-05-16 09:17:17 -07:00
zion	4a59dea4cf	Migrate to text-conditioned prior training (#95 ) * migrate to conditioned prior * unify reader logic with a wrapper (#1) * separate out reader logic * support both training methods * Update train prior to use embedding wrapper (#3) * Support Both Methods * bug fixes * small bug fixes * embedding only wrapper bug * use smaller val perc * final bug fix for embedding-only Co-authored-by: nousr <>	2022-05-15 20:16:38 -07:00
Phil Wang	ecf9e8027d	make sure classifier free guidance is used only if conditional dropout is present on the DiffusionPrior and Decoder classes. also make sure prior can have a different conditional scale than decoder	2022-05-15 19:09:38 -07:00