bring in the dynamic thresholding technique from the Imagen paper, which purportedly improves classifier free guidance for the cascading ddpm

update
link to aidan's test run
2026-02-12 11:34:29 +01:00 · 2022-05-24 18:15:14 -07:00 · 2022-05-24 08:25:40 -07:00 · 2022-05-23 12:41:46 -07:00 · 2022-05-22 22:43:14 -07:00 · 2022-05-22 22:28:45 -07:00
13 changed files with 1090 additions and 63 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,12 @@
+# default experiment tracker data
+.tracker-data/
+
+# Configuration Files
+configs/*
+!configs/*.example
+!configs/*_defaults.py
+!configs/README.md
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ This model is SOTA for text-to-image for now.

 Please join <a href="https://discord.gg/xBPBXfcFHd"><img alt="Join us on Discord" src="https://img.shields.io/discord/823813159592001537?color=5865F2&logo=discord&logoColor=white"></a> if you are interested in helping out with the replication with the <a href="https://laion.ai/">LAION</a> community | <a href="https://www.youtube.com/watch?v=AIOE1l1W0Tw">Yannic Interview</a>

-There was enough interest for a <a href="https://github.com/lucidrains/dalle2-jax">Jax version</a>. I will also eventually extend this to <a href="https://github.com/lucidrains/dalle2-video">text to video</a>, once the repository is in a good place.
+As of 5/23/22, it is no longer SOTA. SOTA will be <a href="https://github.com/lucidrains/imagen-pytorch">here</a>. Jax versions as well as text-to-video project will be shifted towards the Imagen architecture, as it is way simpler.

 ## Status

@@ -24,6 +24,11 @@ There was enough interest for a <a href="https://github.com/lucidrains/dalle2-ja

 *ongoing at 21k steps*

+## Pre-Trained Models
+- LAION is training prior models. Checkpoints are available on <a href="https://huggingface.co/zenglishuci/conditioned-prior">🤗huggingface</a> and the training statistics are available on <a href="https://wandb.ai/nousr_laion/conditioned-prior/reports/LAION-DALLE2-PyTorch-Prior--VmlldzoyMDI2OTIx">🐝WANDB</a>.
+- Decoder - <a href="https://wandb.ai/veldrovive/dalle2_train_decoder/runs/jkrtg0so?workspace=user-veldrovive">In-progress test run</a> 🚧
+- DALL-E 2 🚧
+
 ## Install

 ```bash
@@ -1034,6 +1039,18 @@ Once built, images will be saved to the same directory the command is invoked

 <a href="https://github.com/lucidrains/stylegan2-pytorch">template</a>

+## Appreciation
+
+This library would not have gotten to this working state without the help of
+
+- <a href="https://github.com/nousr">Zion</a> and <a href="https://github.com/krish240574">Kumar</a> for the diffusion training script
+- <a href="https://github.com/Veldrovive">Aidan</a> for the decoder training script and dataloaders
+- <a href="https://github.com/rom1504">Romain</a> for the pull request reviews and project management
+- <a href="https://github.com/Ciaohe">He Cao</a> and <a href="https://github.com/xiankgx">xiankgx</a> for the Q&A and for identifying of critical bugs
+- <a href="https://github.com/crowsonkb">Katherine</a> for her advice
+
+... and many others. Thank you! 🙏
+
 ## Todo

 - [x] finish off gaussian diffusion class for latent embedding - allow for prediction of epsilon
@@ -1064,6 +1081,10 @@ Once built, images will be saved to the same directory the command is invoked
 - [x] bring in cross-scale embedding from iclr paper https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/crossformer.py#L14
 - [x] cross embed layers for downsampling, as an option
 - [x] use an experimental tracker agnostic setup, as done <a href="https://github.com/lucidrains/tf-bind-transformer#simple-trainer-class-for-fine-tuning">here</a>
+- [x] use pydantic for config drive training
+- [x] for both diffusion prior and decoder, all exponential moving averaged models needs to be saved and restored as well (as well as the step number)
+- [x] offer save / load methods on the trainer classes to automatically take care of state dicts for scalers / optimizers / saving versions and checking for breaking changes
+- [x] allow for creation of diffusion prior model off pydantic config classes - consider the same for tracker configs
 - [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo) - consider https://github.com/lucidrains/uformer-pytorch attention-based unet
 - [ ] transcribe code to Jax, which lowers the activation energy for distributed training, given access to TPUs
 - [ ] train on a toy task, offer in colab
@@ -1073,12 +1094,9 @@ Once built, images will be saved to the same directory the command is invoked
 - [ ] test out grid attention in cascading ddpm locally, decide whether to keep or remove
 - [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2
 - [ ] make sure FILIP works with DALL-E2 from x-clip https://arxiv.org/abs/2111.07783
- [ ] offer save / load methods on the trainer classes to automatically take care of state dicts for scalers / optimizers / saving versions and checking for breaking changes
 - [ ] bring in skip-layer excitatons (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training
 - [ ] decoder needs one day worth of refactor for tech debt
 - [ ] allow for unet to be able to condition non-cross attention style as well
- [ ] for all model classes with hyperparameters that changes the network architecture, make it requirement that they must expose a config property, and write a simple function that asserts that it restores the object correctly
- [ ] for both diffusion prior and decoder, all exponential moving averaged models needs to be saved and restored as well (as well as the step number)
 - [ ] read the paper, figure it out, and build it https://github.com/lucidrains/DALLE2-pytorch/issues/89

 ## Citations
@@ -1177,4 +1195,12 @@ Once built, images will be saved to the same directory the command is invoked
 }
 ```

+```bibtex
+@misc{Saharia2022,
+    title   = {Imagen: unprecedented photorealism × deep level of language understanding},
+    author  = {Chitwan Saharia*, William Chan*, Saurabh Saxena†, Lala Li†, Jay Whang†, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho†, David Fleet†, Mohammad Norouzi*},
+    year    = {2022}
+}
+```
+
 *Creating noise from data is easy; creating data from noise is generative modeling.* - <a href="https://arxiv.org/abs/2011.13456">Yang Song's paper</a>
--- a/configs/README.md
+++ b/configs/README.md
@@ -0,0 +1,111 @@
+## DALLE2 Training Configurations
+
+For more complex configuration, we provide the option of using a configuration file instead of command line arguments.
+
+### Decoder Trainer
+
+The decoder trainer has 7 main configuration options. A full example of their use can be found in the [example decoder configuration](train_decoder_config.example.json).
+
+**<ins>Unet</ins>:**
+
+This is a single unet config, which belongs as an array nested under the decoder config as a list of `unets`
+
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `dim`  | Yes      | N/A     | The starting channels of the unet. |
+| `image_embed_dim` | Yes | N/A | The dimension of the image embeddings. |
+| `dim_mults` | No | `(1, 2, 4, 8)` | The growth factors of the channels. |
+
+Any parameter from the `Unet` constructor can also be given here.
+
+**<ins>Decoder</ins>:**
+
+Defines the configuration options for the decoder model. The unets defined above will automatically be inserted.
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `unets` | Yes | N/A | A list of unets, using the configuration above |
+| `image_sizes` | Yes | N/A | The resolution of the image after each upsampling step. The length of this array should be the number of unets defined. |
+| `image_size` | Yes | N/A | Not used. Can be any number. |
+| `timesteps` | No | `1000` | The number of diffusion timesteps used for generation. |
+| `loss_type` | No | `l2` | The loss function. Options are `l1`, `huber`, or `l2`. |
+| `beta_schedule` | No | `cosine` | The noising schedule. Options are `cosine`, `linear`, `quadratic`, `jsd`, or `sigmoid`. |
+| `learned_variance` | No | `True` | Whether to learn the variance. |
+
+Any parameter from the `Decoder` constructor can also be given here.
+
+**<ins>Data</ins>:**
+
+Settings for creation of the dataloaders.
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `webdataset_base_url` | Yes | N/A | The url of a shard in the webdataset with the shard replaced with `{}`[^1]. |
+| `embeddings_url` | No | N/A | The url of the folder containing embeddings shards. Not required if embeddings are in webdataset. |
+| `num_workers` | No | `4` | The number of workers used in the dataloader. |
+| `batch_size` | No | `64` | The batch size. |
+| `start_shard` | No | `0` | Defines the start of the shard range the dataset will recall. |
+| `end_shard` | No | `9999999` | Defines the end of the shard range the dataset will recall. |
+| `shard_width` | No | `6` | Defines the width of one webdataset shard number[^2]. |
+| `index_width` | No | `4` | Defines the width of the index of a file inside a shard[^3]. |
+| `splits` | No | `{ "train": 0.75, "val": 0.15, "test": 0.1 }` | Defines the proportion of shards that will be allocated to the training, validation, and testing datasets. |
+| `shuffle_train` | No | `True` | Whether to shuffle the shards of the training dataset. |
+| `resample_train` | No | `False` | If true, shards will be randomly sampled with replacement from the datasets making the epoch length infinite if a limit is not set. Cannot be enabled if `shuffle_train` is enabled. |
+| `preprocessing` | No | `{ "ToTensor": True }` | Defines preprocessing applied to images from the datasets. |
+
+[^1]: If your shard files have the paths `protocol://path/to/shard/00104.tar`, then the base url would be `protocol://path/to/shard/{}.tar`. If you are using a protocol like `s3`, you need to pipe the tars. For example `pipe:s3cmd get s3://bucket/path/{}.tar -`.
+
+[^2]: This refers to the string length of the shard number for your webdataset shards. For instance, if your webdataset shard has the filename `00104.tar`, your shard length is 5.
+
+[^3]: Inside the webdataset `tar`, you have files named something like `001045945.jpg`. 5 of these characters refer to the shard, and 4 refer to the index of the file in the webdataset (shard is `001041` and index is `5945`). The `index_width` in this case is 4.
+
+**<ins>Train</ins>:**
+
+Settings for controlling the training hyperparameters.
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `epochs` | No | `20` | The number of epochs in the training run. |
+| `lr` | No | `1e-4` | The learning rate. |
+| `wd` | No | `0.01` | The weight decay. |
+| `max_grad_norm`| No | `0.5` | The grad norm clipping. |
+| `save_every_n_samples` | No | `100000` | Samples will be generated and a checkpoint will be saved every `save_every_n_samples` samples. |
+| `device` | No | `cuda:0` | The device to train on. |
+| `epoch_samples` | No | `None` | Limits the number of samples iterated through in each epoch. This must be set if resampling. None means no limit. |
+| `validation_samples` | No | `None` | The number of samples to use for validation. None mean the entire validation set. |
+| `use_ema` | No | `True` | Whether to use exponential moving average models for sampling. |
+| `ema_beta` | No | `0.99` | The ema coefficient. |
+| `save_all` | No | `False` | If True, preserves a checkpoint for every epoch. |
+| `save_latest` | No | `True` | If True, overwrites the `latest.pth` every time the model is saved. |
+| `save_best` | No | `True` | If True, overwrites the `best.pth` every time the model has a lower validation loss than all previous models. |
+| `unet_training_mask` | No | `None` | A boolean array of the same length as the number of unets. If false, the unet is frozen. A value of `None` trains all unets. |
+
+**<ins>Evaluate</ins>:**
+
+Defines which evaluation metrics will be used to test the model.
+Each metric can be enabled by setting its configuration. The configuration keys for each metric are defined by the torchmetrics constructors which will be linked.
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `n_evalation_samples` | No | `1000` | The number of samples to generate to test the model. |
+| `FID` | No | `None` | Setting to an object enables the [Frechet Inception Distance](https://torchmetrics.readthedocs.io/en/stable/image/frechet_inception_distance.html) metric. 
+| `IS` | No | `None` | Setting to an object enables the [Inception Score](https://torchmetrics.readthedocs.io/en/stable/image/inception_score.html) metric.
+| `KID` | No | `None` | Setting to an object enables the [Kernel Inception Distance](https://torchmetrics.readthedocs.io/en/stable/image/kernel_inception_distance.html) metric. |
+| `LPIPS` | No | `None` | Setting to an object enables the [Learned Perceptual Image Patch Similarity](https://torchmetrics.readthedocs.io/en/stable/image/learned_perceptual_image_patch_similarity.html) metric. |
+
+**<ins>Tracker</ins>:**
+
+Selects which tracker to use and configures it.
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `tracker_type` | No | `console` | Which tracker to use. Currently accepts `console` or `wandb`. |
+| `data_path` | No | `./models` | Where the tracker will store local data. |
+| `verbose` | No | `False` | Enables console logging for non-console trackers. |
+
+Other configuration options are required for the specific trackers. To see which are required, reference the initializer parameters of each [tracker](../dalle2_pytorch/trackers.py).
+
+**<ins>Load</ins>:**
+
+Selects where to load a pretrained model from.
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `source` | No | `None` | Supports `file` or `wandb`. |
+| `resume` | No | `False` | If the tracker support resuming the run, resume it. |
+
+Other configuration options are required for loading from a specific source. To see which are required, reference the load methods at the top of the [tracker file](../dalle2_pytorch/trackers.py).
--- a/configs/train_decoder_config.example.json
+++ b/configs/train_decoder_config.example.json
@@ -0,0 +1,99 @@
+{
+    "decoder": {
+        "unets": [
+            {
+                "dim": 128,
+                "image_embed_dim": 768,
+                "cond_dim": 64,
+                "channels": 3,
+                "dim_mults": [1, 2, 4, 8],
+                "attn_dim_head": 32,
+                "attn_heads": 16
+            }
+        ],
+        "image_sizes": [64],
+        "channels": 3,
+        "timesteps": 1000,
+        "loss_type": "l2",
+        "beta_schedule": "cosine",
+        "learned_variance": true
+    },
+    "data": {
+        "webdataset_base_url": "pipe:s3cmd get s3://bucket/path/{}.tar -",
+        "embeddings_url": "s3://bucket/embeddings/path/",
+        "num_workers": 4,
+        "batch_size": 64,
+        "start_shard": 0,
+        "end_shard": 9999999,
+        "shard_width": 6,
+        "index_width": 4,
+        "splits": {
+            "train": 0.75,
+            "val": 0.15,
+            "test": 0.1
+        },
+        "shuffle_train": true,
+        "resample_train": false,
+        "preprocessing": {
+            "RandomResizedCrop": {
+                "size": [128, 128],
+                "scale": [0.75, 1.0],
+                "ratio": [1.0, 1.0]
+            },
+            "ToTensor": true
+        }
+    },
+    "train": {
+        "epochs": 20,
+        "lr": 1e-4,
+        "wd": 0.01,
+        "max_grad_norm": 0.5,
+        "save_every_n_samples": 100000,
+        "n_sample_images": 6,
+        "device": "cuda:0",
+        "epoch_samples": null,
+        "validation_samples": null,
+        "use_ema": true,
+        "ema_beta": 0.99,
+        "amp": false,
+        "save_all": false,
+        "save_latest": true,
+        "save_best": true,
+        "unet_training_mask": [true]
+    },
+    "evaluate": {
+        "n_evaluation_samples": 1000,
+        "FID": {
+            "feature": 64
+        },
+        "IS": {
+            "feature": 64,
+            "splits": 10
+        },
+        "KID": {
+            "feature": 64,
+            "subset_size": 10
+        },
+        "LPIPS": {
+            "net_type": "vgg",
+            "reduction": "mean"
+        }
+    },
+    "tracker": {
+        "tracker_type": "console",
+        "data_path": "./models",
+
+        "wandb_entity": "",
+        "wandb_project": "",
+
+        "verbose": false
+    },
+    "load": {
+        "source": null,
+
+        "run_path": "",
+        "file_path": "",
+
+        "resume": false
+    }
+}
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -59,6 +59,9 @@ def default(val, d):
    return d() if isfunction(d) else d

 def cast_tuple(val, length = 1):
+    if isinstance(val, list):
+        val = tuple(val)
+
    return val if isinstance(val, tuple) else ((val,) * length)

 def module_device(module):
@@ -887,6 +890,8 @@ class DiffusionPrior(BaseGaussianDiffusion):
        )

        if exists(clip):
+            assert image_channels == clip.image_channels, f'channels of image ({image_channels}) should be equal to the channels that CLIP accepts ({clip.image_channels})'
+
            if isinstance(clip, CLIP):
                clip = XClipAdapter(clip, **clip_adapter_overrides)
            elif isinstance(clip, CoCa):
@@ -1699,6 +1704,8 @@ class Decoder(BaseGaussianDiffusion):
        vb_loss_weight = 0.001,
        unconditional = False,
        auto_normalize_img = True,                  # whether to take care of normalizing the image from [0, 1] to [-1, 1] and back automatically - you can turn this off if you want to pass in the [-1, 1] ranged image yourself from the dataloader
+        use_dynamic_thres = False,                  # from the Imagen paper
+        dynamic_thres_percentile = 0.9
    ):
        super().__init__(
            beta_schedule = beta_schedule,
@@ -1707,12 +1714,19 @@ class Decoder(BaseGaussianDiffusion):
        )

        self.unconditional = unconditional
-        assert not (condition_on_text_encodings and unconditional), 'unconditional decoder image generation cannot be set to True if conditioning on text is present'

-        assert self.unconditional or (exists(clip) ^ exists(image_size)), 'either CLIP is supplied, or you must give the image_size and channels (usually 3 for RGB)'
+        # text conditioning
+
+        assert not (condition_on_text_encodings and unconditional), 'unconditional decoder image generation cannot be set to True if conditioning on text is present'
+        self.condition_on_text_encodings = condition_on_text_encodings
+
+        # clip

        self.clip = None
        if exists(clip):
+            assert not unconditional, 'clip must not be given if doing unconditional image training'
+            assert channels == clip.image_channels, f'channels of image ({channels}) should be equal to the channels that CLIP accepts ({clip.image_channels})'
+
            if isinstance(clip, CLIP):
                clip = XClipAdapter(clip, **clip_adapter_overrides)
            elif isinstance(clip, CoCa):
@@ -1722,13 +1736,20 @@ class Decoder(BaseGaussianDiffusion):
            assert isinstance(clip, BaseClipAdapter)

            self.clip = clip
-            self.clip_image_size = clip.image_size
-            self.channels = clip.image_channels
-        else:
-            self.clip_image_size = image_size
-            self.channels = channels

-        self.condition_on_text_encodings = condition_on_text_encodings
+        # determine image size, with image_size and image_sizes taking precedence
+
+        if exists(image_size) or exists(image_sizes):
+            assert exists(image_size) ^ exists(image_sizes), 'only one of image_size or image_sizes must be given'
+            image_size = default(image_size, lambda: image_sizes[-1])
+        elif exists(clip):
+            image_size = clip.image_size
+        else:
+            raise Error('either image_size, image_sizes, or clip must be given to decoder')
+
+        # channels
+
+        self.channels = channels

        # automatically take care of ensuring that first unet is unconditional
        # while the rest of the unets are conditioned on the low resolution image produced by previous unet
@@ -1770,7 +1791,7 @@ class Decoder(BaseGaussianDiffusion):

        # unet image sizes

-        image_sizes = default(image_sizes, (self.clip_image_size,))
+        image_sizes = default(image_sizes, (image_size,))
        image_sizes = tuple(sorted(set(image_sizes)))

        assert len(self.unets) == len(image_sizes), f'you did not supply the correct number of u-nets ({len(self.unets)}) for resolutions {image_sizes}'
@@ -1807,7 +1828,13 @@ class Decoder(BaseGaussianDiffusion):
        self.clip_denoised = clip_denoised
        self.clip_x_start = clip_x_start

+        # dynamic thresholding settings, if clipping denoised during sampling
+
+        self.use_dynamic_thres = use_dynamic_thres
+        self.dynamic_thres_percentile = dynamic_thres_percentile
+
        # normalize and unnormalize image functions
+
        self.normalize_img = normalize_neg_one_to_one if auto_normalize_img else identity
        self.unnormalize_img = unnormalize_zero_to_one if auto_normalize_img else identity

@@ -1848,7 +1875,21 @@ class Decoder(BaseGaussianDiffusion):
            x_recon = self.predict_start_from_noise(x, t = t, noise = pred)

        if clip_denoised:
-            x_recon.clamp_(-1., 1.)
+            # s is the threshold amount
+            # static thresholding would just be s = 1
+            s = 1.
+            if self.use_dynamic_thres:
+                s = torch.quantile(
+                    rearrange(x_recon, 'b ... -> b (...)').abs(),
+                    self.dynamic_thres_percentile,
+                    dim = -1
+                )
+
+                s.clamp_(min = 1.)
+                s = s.view(-1, *((1,) * (x_recon.ndim - 1)))
+
+            # clip by threshold, depending on whether static or dynamic
+            x_recon = x_recon.clamp(-s, s) / s

        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)

--- a/dalle2_pytorch/optimizer.py
+++ b/dalle2_pytorch/optimizer.py
@@ -11,7 +11,8 @@ def get_optimizer(
    wd = 1e-2,
    betas = (0.9, 0.999),
    eps = 1e-8,
-    filter_by_requires_grad = False
+    filter_by_requires_grad = False,
+    **kwargs
 ):
    if filter_by_requires_grad:
        params = list(filter(lambda t: t.requires_grad, params))
--- a/dalle2_pytorch/trackers.py
+++ b/dalle2_pytorch/trackers.py
@@ -1,20 +1,32 @@
 import os
+from pathlib import Path
+import importlib
 from itertools import zip_longest
-from enum import Enum
+
 import torch
 from torch import nn

+# constants
+
+DEFAULT_DATA_PATH = './.tracker-data'
+
 # helper functions

 def exists(val):
    return val is not None

-def load_wandb_state_dict(run_path, file_path, **kwargs):
+def import_or_print_error(pkg_name, err_str = None):
    try:
-        import wandb
-    except ImportError as e:
-        print('`pip install wandb` to use the wandb recall function')
-        raise e
+        return importlib.import_module(pkg_name)
+    except ModuleNotFoundError as e:
+        if exists(err_str):
+            print(err_str)
+        exit()
+
+# load state dict functions
+
+def load_wandb_state_dict(run_path, file_path, **kwargs):
+    wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb recall function')
    file_reference = wandb.restore(file_path, run_path=run_path)
    return torch.load(file_reference.name)

@@ -24,11 +36,10 @@ def load_local_state_dict(file_path, **kwargs):
 # base class

 class BaseTracker(nn.Module):
-    def __init__(self, data_path):
+    def __init__(self, data_path = DEFAULT_DATA_PATH):
        super().__init__()
-        assert data_path is not None, "Tracker must have a data_path to save local content"
-        self.data_path = os.path.abspath(data_path)
-        os.makedirs(self.data_path, exist_ok=True)
+        self.data_path = Path(data_path)
+        self.data_path.mkdir(parents = True, exist_ok = True)

    def init(self, config, **kwargs):
        raise NotImplementedError
@@ -66,28 +77,19 @@ class ConsoleTracker(BaseTracker):
    def log(self, log, **kwargs):
        print(log)

-    def log_images(self, images, **kwargs):
-        """
-        Currently, do nothing with console logged images 
-        """
+    def log_images(self, images, **kwargs): # noop for logging images
        pass
    
    def save_state_dict(self, state_dict, relative_path, **kwargs):
-        torch.save(state_dict, os.path.join(self.data_path, relative_path))
+        torch.save(state_dict, str(self.data_path / relative_path))

 # basic wandb class

 class WandbTracker(BaseTracker):
-    def __init__(self, data_path):
-        super().__init__(data_path)
-        try:
-            import wandb
-        except ImportError as e:
-            print('`pip install wandb` to use the wandb experiment tracker')
-            raise e
-
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb experiment tracker')
        os.environ["WANDB_SILENT"] = "true"
-        self.wandb = wandb

    def init(self, **config):
        self.wandb.init(**config)
@@ -108,6 +110,6 @@ class WandbTracker(BaseTracker):
        """
        Saves a state_dict to disk and uploads it 
        """
-        full_path = os.path.join(self.data_path, relative_path)
+        full_path = str(self.data_path / relative_path)
        torch.save(state_dict, full_path)
-        self.wandb.save(full_path, base_path=self.data_path)  # Upload and keep relative to data_path
+        self.wandb.save(full_path, base_path = str(self.data_path))  # Upload and keep relative to data_path
--- a/dalle2_pytorch/train_configs.py
+++ b/dalle2_pytorch/train_configs.py
@@ -0,0 +1,189 @@
+import json
+from torchvision import transforms as T
+from pydantic import BaseModel, validator, root_validator
+from typing import List, Iterable, Optional, Union, Tuple, Dict, Any
+
+from dalle2_pytorch.dalle2_pytorch import Unet, Decoder, DiffusionPrior, DiffusionPriorNetwork
+
+# helper functions
+
+def exists(val):
+    return val is not None
+
+def default(val, d):
+    return val if exists(val) else d
+
+def ListOrTuple(inner_type):
+    return Union[List[inner_type], Tuple[inner_type]]
+
+# pydantic classes
+
+class DiffusionPriorNetworkConfig(BaseModel):
+    dim: int
+    depth: int
+    num_timesteps: int = None
+    num_time_embeds: int = 1
+    num_image_embeds: int = 1
+    num_text_embeds: int = 1
+    dim_head: int = 64
+    heads: int = 8
+    ff_mult: int = 4
+    norm_out: bool = True
+    attn_dropout: float = 0.
+    ff_dropout: float = 0.
+    final_proj: bool = True
+    normformer: bool = False
+    rotary_emb: bool = True
+
+class DiffusionPriorConfig(BaseModel):
+    # only clip-less diffusion prior config for now
+    net: DiffusionPriorNetworkConfig
+    image_embed_dim: int
+    image_size: int
+    image_channels: int = 3
+    timesteps: int = 1000
+    cond_drop_prob: float = 0.
+    loss_type: str = 'l2'
+    predict_x_start: bool = True
+    beta_schedule: str = 'cosine'
+
+    def create(self):
+        kwargs = self.dict()
+        diffusion_prior_network = DiffusionPriorNetwork(**kwargs.pop('net'))
+        return DiffusionPrior(net = diffusion_prior_network, **kwargs)
+
+    class Config:
+        extra = "allow"
+
+class UnetConfig(BaseModel):
+    dim: int
+    dim_mults: ListOrTuple(int)
+    image_embed_dim: int = None
+    cond_dim: int = None
+    channels: int = 3
+    attn_dim_head: int = 32
+    attn_heads: int = 16
+
+    class Config:
+        extra = "allow"
+
+class DecoderConfig(BaseModel):
+    unets: ListOrTuple(UnetConfig)
+    image_size: int = None
+    image_sizes: ListOrTuple(int) = None
+    channels: int = 3
+    timesteps: int = 1000
+    loss_type: str = 'l2'
+    beta_schedule: str = 'cosine'
+    learned_variance: bool = True
+    image_cond_drop_prob: float = 0.1
+    text_cond_drop_prob: float = 0.5
+
+    def create(self):
+        decoder_kwargs = self.dict()
+        unet_configs = decoder_kwargs.pop('unets')
+        unets = [Unet(**config) for config in unet_configs]
+        return Decoder(unets, **decoder_kwargs)
+
+    @validator('image_sizes')
+    def check_image_sizes(cls, image_sizes, values):
+        if exists(values.get('image_size')) ^ exists(image_sizes):
+            return image_sizes
+        raise ValueError('either image_size or image_sizes is required, but not both')
+
+    class Config:
+        extra = "allow"
+
+class TrainSplitConfig(BaseModel):
+    train: float = 0.75
+    val: float = 0.15
+    test: float = 0.1
+
+    @root_validator
+    def validate_all(cls, fields):
+        if sum([*fields.values()]) != 1.:
+            raise ValueError(f'{fields.keys()} must sum to 1.0')
+        return fields
+
+class DecoderDataConfig(BaseModel):
+    webdataset_base_url: str     # path to a webdataset with jpg images
+    embeddings_url: str          # path to .npy files with embeddings
+    num_workers: int = 4
+    batch_size: int = 64
+    start_shard: int = 0
+    end_shard: int = 9999999
+    shard_width: int = 6
+    index_width: int = 4
+    splits: TrainSplitConfig
+    shuffle_train: bool = True
+    resample_train: bool = False
+    preprocessing: Dict[str, Any] = {'ToTensor': True}
+
+    @property
+    def img_preproc(self):
+        def _get_transformation(transformation_name, **kwargs):
+            if transformation_name == "RandomResizedCrop":
+                return T.RandomResizedCrop(**kwargs)
+            elif transformation_name == "RandomHorizontalFlip":
+                return T.RandomHorizontalFlip()
+            elif transformation_name == "ToTensor":
+                return T.ToTensor()
+
+        transforms = []
+        for transform_name, transform_kwargs_or_bool in self.preprocessing.items():
+            transform_kwargs = {} if not isinstance(transform_kwargs_or_bool, dict) else transform_kwargs_or_bool
+            transforms.append(_get_transformation(transform_name, **transform_kwargs))
+        return T.Compose(transforms)
+
+class DecoderTrainConfig(BaseModel):
+    epochs: int = 20
+    lr: float = 1e-4
+    wd: float = 0.01
+    max_grad_norm: float = 0.5
+    save_every_n_samples: int = 100000
+    n_sample_images: int = 6                       # The number of example images to produce when sampling the train and test dataset
+    device: str = 'cuda:0'
+    epoch_samples: int = None                      # Limits the number of samples per epoch. None means no limit. Required if resample_train is true as otherwise the number of samples per epoch is infinite.
+    validation_samples: int = None                 # Same as above but for validation.
+    use_ema: bool = True
+    ema_beta: float = 0.99
+    amp: bool = False
+    save_all: bool = False                         # Whether to preserve all checkpoints
+    save_latest: bool = True                       # Whether to always save the latest checkpoint
+    save_best: bool = True                         # Whether to save the best checkpoint
+    unet_training_mask: ListOrTuple(bool) = None   # If None, use all unets
+
+class DecoderEvaluateConfig(BaseModel):
+    n_evaluation_samples: int = 1000
+    FID: Dict[str, Any] = None
+    IS: Dict[str, Any] = None
+    KID: Dict[str, Any] = None
+    LPIPS: Dict[str, Any] = None
+
+class TrackerConfig(BaseModel):
+    tracker_type: str = 'console'           # Decoder currently supports console and wandb
+    data_path: str = './models'             # The path where files will be saved locally
+    init_config: Dict[str, Any] = None
+    wandb_entity: str = ''                  # Only needs to be set if tracker_type is wandb
+    wandb_project: str = ''
+    verbose: bool = False                   # Whether to print console logging for non-console trackers
+
+class DecoderLoadConfig(BaseModel):
+    source: str = None                      # Supports file and wandb
+    run_path: str = ''                      # Used only if source is wandb
+    file_path: str = ''                     # The local filepath if source is file. If source is wandb, the relative path to the model file in wandb.
+    resume: bool = False                    # If using wandb, whether to resume the run
+
+class TrainDecoderConfig(BaseModel):
+    decoder: DecoderConfig
+    data: DecoderDataConfig
+    train: DecoderTrainConfig
+    evaluate: DecoderEvaluateConfig
+    tracker: TrackerConfig
+    load: DecoderLoadConfig
+
+    @classmethod
+    def from_json_path(cls, json_path):
+        with open(json_path) as f:
+            config = json.load(f)
+        return cls(**config)
--- a/dalle2_pytorch/trainer.py
+++ b/dalle2_pytorch/trainer.py
@@ -1,5 +1,6 @@
 import time
 import copy
+from pathlib import Path
 from math import ceil
 from functools import partial, wraps
 from collections.abc import Iterable
@@ -55,6 +56,10 @@ def num_to_groups(num, divisor):
        arr.append(remainder)
    return arr

+def get_pkg_version():
+    from pkg_resources import get_distribution
+    return get_distribution('dalle2_pytorch').version
+
 # decorators

 def cast_torch_tensor(fn):
@@ -128,12 +133,6 @@ def split_args_and_kwargs(*args, split_size = None, **kwargs):
        chunk_size_frac = chunk_size / batch_size
        yield chunk_size_frac, (chunked_args, chunked_kwargs)

-# print helpers
-
-def print_ribbon(s, symbol = '=', repeat = 40):
-    flank = symbol * repeat
-    return f'{flank} {s} {flank}'
-
 # saving and loading functions

 # for diffusion prior
@@ -191,7 +190,7 @@ class EMA(nn.Module):
        self.update_after_step = update_after_step  // update_every # only start EMA after this step number, starting at 0

        self.register_buffer('initted', torch.Tensor([False]))
-        self.register_buffer('step', torch.tensor([0.]))
+        self.register_buffer('step', torch.tensor([0]))

    def restore_ema_model_device(self):
        device = self.initted.device
@@ -287,7 +286,50 @@ class DiffusionPriorTrainer(nn.Module):

        self.max_grad_norm = max_grad_norm

-        self.register_buffer('step', torch.tensor([0.]))
+        self.register_buffer('step', torch.tensor([0]))
+
+    def save(self, path, overwrite = True, **kwargs):
+        path = Path(path)
+        assert not (path.exists() and not overwrite)
+        path.parent.mkdir(parents = True, exist_ok = True)
+
+        save_obj = dict(
+            scaler = self.scaler.state_dict(),
+            optimizer = self.optimizer.state_dict(),
+            model = self.diffusion_prior.state_dict(),
+            version = get_pkg_version(),
+            step = self.step.item(),
+            **kwargs
+        )
+
+        if self.use_ema:
+            save_obj = {**save_obj, 'ema': self.ema_diffusion_prior.state_dict()}
+
+        torch.save(save_obj, str(path))
+
+    def load(self, path, only_model = False, strict = True):
+        path = Path(path)
+        assert path.exists()
+
+        loaded_obj = torch.load(str(path))
+
+        if get_pkg_version() != loaded_obj['version']:
+            print(f'loading saved diffusion prior at version {loaded_obj["version"]} but current package version is at {get_pkg_version()}')
+
+        self.diffusion_prior.load_state_dict(loaded_obj['model'], strict = strict)
+        self.step.copy_(torch.ones_like(self.step) * loaded_obj['step'])
+
+        if only_model:
+            return loaded_obj
+
+        self.scaler.load_state_dict(loaded_obj['scaler'])
+        self.optimizer.load_state_dict(loaded_obj['optimizer'])
+
+        if self.use_ema:
+            assert 'ema' in loaded_obj
+            self.ema_diffusion_prior.load_state_dict(loaded_obj['ema'], strict = strict)
+
+        return loaded_obj

    def update(self):
        if exists(self.max_grad_norm):
@@ -410,6 +452,60 @@ class DecoderTrainer(nn.Module):

        self.register_buffer('step', torch.tensor([0.]))

+    def save(self, path, overwrite = True, **kwargs):
+        path = Path(path)
+        assert not (path.exists() and not overwrite)
+        path.parent.mkdir(parents = True, exist_ok = True)
+
+        save_obj = dict(
+            model = self.decoder.state_dict(),
+            version = get_pkg_version(),
+            step = self.step.item(),
+            **kwargs
+        )
+
+        for ind in range(0, self.num_unets):
+            scaler_key = f'scaler{ind}'
+            optimizer_key = f'scaler{ind}'
+            scaler = getattr(self, scaler_key)
+            optimizer = getattr(self, optimizer_key)
+            save_obj = {**save_obj, scaler_key: scaler.state_dict(), optimizer_key: optimizer.state_dict()}
+
+        if self.use_ema:
+            save_obj = {**save_obj, 'ema': self.ema_unets.state_dict()}
+
+        torch.save(save_obj, str(path))
+
+    def load(self, path, only_model = False, strict = True):
+        path = Path(path)
+        assert path.exists()
+
+        loaded_obj = torch.load(str(path))
+
+        if get_pkg_version() != loaded_obj['version']:
+            print(f'loading saved decoder at version {loaded_obj["version"]}, but current package version is {get_pkg_version()}')
+
+        self.decoder.load_state_dict(loaded_obj['model'], strict = strict)
+        self.step.copy_(torch.ones_like(self.step) * loaded_obj['step'])
+
+        if only_model:
+            return loaded_obj
+
+        for ind in range(0, self.num_unets):
+            scaler_key = f'scaler{ind}'
+            optimizer_key = f'scaler{ind}'
+            scaler = getattr(self, scaler_key)
+            optimizer = getattr(self, optimizer_key)
+
+            scaler.load_state_dict(loaded_obj[scaler_key])
+            optimizer.load_state_dict(loaded_obj[optimizer_key])
+
+        if self.use_ema:
+            assert 'ema' in loaded_obj
+            self.ema_unets.load_state_dict(loaded_obj['ema'], strict = strict)
+
+        return loaded_obj
+
    @property
    def unets(self):
        return nn.ModuleList([ema.ema_model for ema in self.ema_unets])
--- a/dalle2_pytorch/utils.py
+++ b/dalle2_pytorch/utils.py
@@ -0,0 +1,19 @@
+import time
+
+# time helpers
+
+class Timer:
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.last_time = time.time()
+
+    def elapsed(self):
+        return time.time() - self.last_time
+
+# print helpers
+
+def print_ribbon(s, symbol = '=', repeat = 40):
+    flank = symbol * repeat
+    return f'{flank} {s} {flank}'
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.3.3',
+  version = '0.5.0',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
@@ -32,6 +32,7 @@ setup(
    'kornia>=0.5.4',
    'numpy',
    'pillow',
+    'pydantic',
    'resize-right>=0.0.2',
    'rotary-embedding-torch',
    'torch>=1.10',
@@ -41,7 +42,8 @@ setup(
    'x-clip>=0.4.4',
    'youtokentome',
    'webdataset>=0.2.5',
-    'fsspec>=2022.1.0'
+    'fsspec>=2022.1.0',
+    'torchmetrics[image]>=0.8.0'
  ],
  classifiers=[
    'Development Status :: 4 - Beta',
--- a/train_decoder.py
+++ b/train_decoder.py
@@ -0,0 +1,442 @@
+from dalle2_pytorch import Unet, Decoder
+from dalle2_pytorch.trainer import DecoderTrainer
+from dalle2_pytorch.dataloaders import create_image_embedding_dataloader
+from dalle2_pytorch.trackers import WandbTracker, ConsoleTracker
+from dalle2_pytorch.train_configs import TrainDecoderConfig
+from dalle2_pytorch.utils import Timer, print_ribbon
+
+import torchvision
+import torch
+from torchmetrics.image.fid import FrechetInceptionDistance
+from torchmetrics.image.inception import InceptionScore
+from torchmetrics.image.kid import KernelInceptionDistance
+from torchmetrics.image.lpip import LearnedPerceptualImagePatchSimilarity
+import webdataset as wds
+import click
+
+# constants
+
+TRAIN_CALC_LOSS_EVERY_ITERS = 10
+VALID_CALC_LOSS_EVERY_ITERS = 10
+
+# helpers functions
+
+def exists(val):
+    return val is not None
+
+# main functions
+
+def create_dataloaders(
+    available_shards,
+    webdataset_base_url,
+    embeddings_url,
+    shard_width=6,
+    num_workers=4,
+    batch_size=32,
+    n_sample_images=6,
+    shuffle_train=True,
+    resample_train=False,
+    img_preproc = None,
+    index_width=4,
+    train_prop = 0.75,
+    val_prop = 0.15,
+    test_prop = 0.10,
+    **kwargs
+):
+    """
+    Randomly splits the available shards into train, val, and test sets and returns a dataloader for each
+    """
+    assert train_prop + test_prop + val_prop == 1
+    num_train = round(train_prop*len(available_shards))
+    num_test = round(test_prop*len(available_shards))
+    num_val = len(available_shards) - num_train - num_test
+    assert num_train + num_test + num_val == len(available_shards), f"{num_train} + {num_test} + {num_val} = {num_train + num_test + num_val} != {len(available_shards)}"
+    train_split, test_split, val_split = torch.utils.data.random_split(available_shards, [num_train, num_test, num_val], generator=torch.Generator().manual_seed(0))
+
+    # The shard number in the webdataset file names has a fixed width. We zero pad the shard numbers so they correspond to a filename.
+    train_urls = [webdataset_base_url.format(str(shard).zfill(shard_width)) for shard in train_split]
+    test_urls = [webdataset_base_url.format(str(shard).zfill(shard_width)) for shard in test_split]
+    val_urls = [webdataset_base_url.format(str(shard).zfill(shard_width)) for shard in val_split]
+    
+    create_dataloader = lambda tar_urls, shuffle=False, resample=False, with_text=False, for_sampling=False: create_image_embedding_dataloader(
+        tar_url=tar_urls,
+        num_workers=num_workers,
+        batch_size=batch_size if not for_sampling else n_sample_images,
+        embeddings_url=embeddings_url,
+        index_width=index_width,
+        shuffle_num = None,
+        extra_keys= ["txt"] if with_text else [],
+        shuffle_shards = shuffle,
+        resample_shards = resample, 
+        img_preproc=img_preproc,
+        handler=wds.handlers.warn_and_continue
+    )
+
+    train_dataloader = create_dataloader(train_urls, shuffle=shuffle_train, resample=resample_train)
+    train_sampling_dataloader = create_dataloader(train_urls, shuffle=False, for_sampling=True)
+    val_dataloader = create_dataloader(val_urls, shuffle=False, with_text=True)
+    test_dataloader = create_dataloader(test_urls, shuffle=False, with_text=True)
+    test_sampling_dataloader = create_dataloader(test_urls, shuffle=False, for_sampling=True)
+    return {
+        "train": train_dataloader,
+        "train_sampling": train_sampling_dataloader,
+        "val": val_dataloader,
+        "test": test_dataloader,
+        "test_sampling": test_sampling_dataloader
+    }
+
+def get_dataset_keys(dataloader):
+    """
+    It is sometimes neccesary to get the keys the dataloader is returning. Since the dataset is burried in the dataloader, we need to do a process to recover it.
+    """
+    # If the dataloader is actually a WebLoader, we need to extract the real dataloader
+    if isinstance(dataloader, wds.WebLoader):
+        dataloader = dataloader.pipeline[0]
+    return dataloader.dataset.key_map
+
+def get_example_data(dataloader, device, n=5):
+    """
+    Samples the dataloader and returns a zipped list of examples
+    """
+    images = []
+    embeddings = []
+    captions = []
+    dataset_keys = get_dataset_keys(dataloader)
+    has_caption = "txt" in dataset_keys
+    for data in dataloader:
+        if has_caption:
+            img, emb, txt = data
+        else:
+            img, emb = data
+            txt = [""] * emb.shape[0]
+        img = img.to(device=device, dtype=torch.float)
+        emb = emb.to(device=device, dtype=torch.float)
+        images.extend(list(img))
+        embeddings.extend(list(emb))
+        captions.extend(list(txt))
+        if len(images) >= n:
+            break
+    print("Generated {} examples".format(len(images)))
+    return list(zip(images[:n], embeddings[:n], captions[:n]))
+
+def generate_samples(trainer, example_data, text_prepend=""):
+    """
+    Takes example data and generates images from the embeddings
+    Returns three lists: real images, generated images, and captions
+    """
+    real_images, embeddings, txts = zip(*example_data)
+    embeddings_tensor = torch.stack(embeddings)
+    samples = trainer.sample(embeddings_tensor)
+    generated_images = list(samples)
+    captions = [text_prepend + txt for txt in txts]
+    return real_images, generated_images, captions
+
+def generate_grid_samples(trainer, examples, text_prepend=""):
+    """
+    Generates samples and uses torchvision to put them in a side by side grid for easy viewing
+    """
+    real_images, generated_images, captions = generate_samples(trainer, examples, text_prepend)
+    grid_images = [torchvision.utils.make_grid([original_image, generated_image]) for original_image, generated_image in zip(real_images, generated_images)]
+    return grid_images, captions
+                    
+def evaluate_trainer(trainer, dataloader, device, n_evaluation_samples=1000, FID=None, IS=None, KID=None, LPIPS=None):
+    """
+    Computes evaluation metrics for the decoder
+    """
+    metrics = {}
+    # Prepare the data
+    examples = get_example_data(dataloader, device, n_evaluation_samples)
+    real_images, generated_images, captions = generate_samples(trainer, examples)
+    real_images = torch.stack(real_images).to(device=device, dtype=torch.float)
+    generated_images = torch.stack(generated_images).to(device=device, dtype=torch.float)
+    # Convert from [0, 1] to [0, 255] and from torch.float to torch.uint8
+    int_real_images = real_images.mul(255).add(0.5).clamp(0, 255).type(torch.uint8)
+    int_generated_images = generated_images.mul(255).add(0.5).clamp(0, 255).type(torch.uint8)
+    if exists(FID):
+        fid = FrechetInceptionDistance(**FID)
+        fid.to(device=device)
+        fid.update(int_real_images, real=True)
+        fid.update(int_generated_images, real=False)
+        metrics["FID"] = fid.compute().item()
+    if exists(IS):
+        inception = InceptionScore(**IS)
+        inception.to(device=device)
+        inception.update(int_real_images)
+        is_mean, is_std = inception.compute()
+        metrics["IS_mean"] = is_mean.item()
+        metrics["IS_std"] = is_std.item()
+    if exists(KID):
+        kernel_inception = KernelInceptionDistance(**KID)
+        kernel_inception.to(device=device)
+        kernel_inception.update(int_real_images, real=True)
+        kernel_inception.update(int_generated_images, real=False)
+        kid_mean, kid_std = kernel_inception.compute()
+        metrics["KID_mean"] = kid_mean.item()
+        metrics["KID_std"] = kid_std.item()
+    if exists(LPIPS):
+        # Convert from [0, 1] to [-1, 1]
+        renorm_real_images = real_images.mul(2).sub(1)
+        renorm_generated_images = generated_images.mul(2).sub(1)
+        lpips = LearnedPerceptualImagePatchSimilarity(**LPIPS)
+        lpips.to(device=device)
+        lpips.update(renorm_real_images, renorm_generated_images)
+        metrics["LPIPS"] = lpips.compute().item()
+    return metrics
+
+def save_trainer(tracker, trainer, epoch, step, validation_losses, relative_paths):
+    """
+    Logs the model with an appropriate method depending on the tracker
+    """
+    if isinstance(relative_paths, str):
+        relative_paths = [relative_paths]
+    trainer_state_dict = {}
+    trainer_state_dict["trainer"] = trainer.state_dict()
+    trainer_state_dict['epoch'] = epoch
+    trainer_state_dict['step'] = step
+    trainer_state_dict['validation_losses'] = validation_losses
+    for relative_path in relative_paths:
+        tracker.save_state_dict(trainer_state_dict, relative_path)
+    
+def recall_trainer(tracker, trainer, recall_source=None, **load_config):
+    """
+    Loads the model with an appropriate method depending on the tracker
+    """
+    print(print_ribbon(f"Loading model from {recall_source}"))
+    state_dict = tracker.recall_state_dict(recall_source, **load_config)
+    trainer.load_state_dict(state_dict["trainer"])
+    print("Model loaded")
+    return state_dict["epoch"], state_dict["step"], state_dict["validation_losses"]
+
+def train(
+    dataloaders,
+    decoder,
+    tracker,
+    inference_device,
+    load_config=None,
+    evaluate_config=None,
+    epoch_samples = None,  # If the training dataset is resampling, we have to manually stop an epoch
+    validation_samples = None,
+    epochs = 20,
+    n_sample_images = 5,
+    save_every_n_samples = 100000,
+    save_all=False,
+    save_latest=True,
+    save_best=True,
+    unet_training_mask=None,
+    **kwargs
+):
+    """
+    Trains a decoder on a dataset.
+    """
+    trainer = DecoderTrainer(  # TODO: Change the get_optimizer function so that it can take arbitrary named args so we can just put **kwargs as an argument here
+        decoder,
+        **kwargs
+    )
+    # Set up starting model and parameters based on a recalled state dict
+    start_step = 0
+    start_epoch = 0
+    validation_losses = []
+
+    if exists(load_config) and exists(load_config.source):
+        start_epoch, start_step, validation_losses = recall_trainer(tracker, trainer, recall_source=load_config.source, **load_config)
+    trainer.to(device=inference_device)
+
+    if not exists(unet_training_mask):
+        # Then the unet mask should be true for all unets in the decoder
+        unet_training_mask = [True] * trainer.num_unets
+    assert len(unet_training_mask) == trainer.num_unets, f"The unet training mask should be the same length as the number of unets in the decoder. Got {len(unet_training_mask)} and {trainer.num_unets}"
+
+    print(print_ribbon("Generating Example Data", repeat=40))
+    print("This can take a while to load the shard lists...")
+    train_example_data = get_example_data(dataloaders["train_sampling"], inference_device, n_sample_images)
+    test_example_data = get_example_data(dataloaders["test_sampling"], inference_device, n_sample_images)
+    
+    send_to_device = lambda arr: [x.to(device=inference_device, dtype=torch.float) for x in arr]
+    step = start_step
+
+    for epoch in range(start_epoch, epochs):
+        print(print_ribbon(f"Starting epoch {epoch}", repeat=40))
+
+        timer = Timer()
+
+        sample = 0
+        last_sample = 0
+        last_snapshot = 0
+
+        losses = []
+
+        for i, (img, emb) in enumerate(dataloaders["train"]):
+            step += 1
+            sample += img.shape[0]
+            img, emb = send_to_device((img, emb))
+            
+            trainer.train()
+            for unet in range(1, trainer.num_unets+1):
+                # Check if this is a unet we are training
+                if not unet_training_mask[unet-1]: # Unet index is the unet number - 1
+                    continue
+
+                loss = trainer.forward(img, image_embed=emb, unet_number=unet)
+                trainer.update(unet_number=unet)
+                losses.append(loss)
+
+            samples_per_sec = (sample - last_sample) / timer.elapsed()
+
+            timer.reset()
+            last_sample = sample
+
+            if i % TRAIN_CALC_LOSS_EVERY_ITERS == 0:
+                average_loss = sum(losses) / len(losses)
+                log_data = {
+                    "Training loss": average_loss,
+                    "Epoch": epoch,
+                    "Sample": sample,
+                    "Step": i,
+                    "Samples per second": samples_per_sec
+                }
+                tracker.log(log_data, step=step, verbose=True)
+                losses = []
+
+            if last_snapshot + save_every_n_samples < sample:  # This will miss by some amount every time, but it's not a big deal... I hope
+                last_snapshot = sample
+                # We need to know where the model should be saved
+                save_paths = []
+                if save_latest:
+                    save_paths.append("latest.pth")
+                if save_all:
+                    save_paths.append(f"checkpoints/epoch_{epoch}_step_{step}.pth")
+
+                save_trainer(tracker, trainer, epoch, step, validation_losses, save_paths)
+
+                if exists(n_sample_images) and n_sample_images > 0:
+                    trainer.eval()
+                    train_images, train_captions = generate_grid_samples(trainer, train_example_data, "Train: ")
+                    tracker.log_images(train_images, captions=train_captions, image_section="Train Samples", step=step)
+
+            if exists(epoch_samples) and sample >= epoch_samples:
+                break
+
+        trainer.eval()
+        print(print_ribbon(f"Starting Validation {epoch}", repeat=40))
+        with torch.no_grad():
+            sample = 0
+            average_loss = 0
+            timer = Timer()
+            for i, (img, emb, txt) in enumerate(dataloaders["val"]):
+                sample += img.shape[0]
+                img, emb = send_to_device((img, emb))
+                
+                for unet in range(1, len(decoder.unets)+1):
+                    loss = trainer.forward(img.float(), image_embed=emb.float(), unet_number=unet)
+                    average_loss += loss
+
+                if i % VALID_CALC_LOSS_EVERY_ITERS == 0:
+                    print(f"Epoch {epoch}/{epochs} - {sample / timer.elapsed():.2f} samples/sec")
+                    print(f"Loss: {average_loss / (i+1)}")
+                    print("")
+
+                if exists(validation_samples) and sample >= validation_samples:
+                    break
+
+            average_loss /= i+1
+            log_data = {
+                "Validation loss": average_loss
+            }
+            tracker.log(log_data, step=step, verbose=True)
+
+        # Compute evaluation metrics
+        if exists(evaluate_config):
+            print(print_ribbon(f"Starting Evaluation {epoch}", repeat=40))
+            evaluation = evaluate_trainer(trainer, dataloaders["val"], inference_device, **evaluate_config)
+            tracker.log(evaluation, step=step, verbose=True)
+
+        # Generate sample images
+        print(print_ribbon(f"Sampling Set {epoch}", repeat=40))
+        test_images, test_captions = generate_grid_samples(trainer, test_example_data, "Test: ")
+        train_images, train_captions = generate_grid_samples(trainer, train_example_data, "Train: ")
+        tracker.log_images(test_images, captions=test_captions, image_section="Test Samples", step=step)
+        tracker.log_images(train_images, captions=train_captions, image_section="Train Samples", step=step)
+
+        print(print_ribbon(f"Starting Saving {epoch}", repeat=40))
+        # Get the same paths
+        save_paths = []
+        if save_latest:
+            save_paths.append("latest.pth")
+        if save_best and (len(validation_losses) == 0 or average_loss < min(validation_losses)):
+            save_paths.append("best.pth")
+        validation_losses.append(average_loss)
+        save_trainer(tracker, trainer, epoch, step, validation_losses, save_paths)
+
+def create_tracker(config, tracker_type=None, data_path=None, **kwargs):
+    """
+    Creates a tracker of the specified type and initializes special features based on the full config
+    """
+    tracker_config = config.tracker
+    init_config = {}
+
+    if exists(tracker_config.init_config):
+        init_config["config"] = tracker_config.init_config
+
+    if tracker_type == "console":
+        tracker = ConsoleTracker(**init_config)
+    elif tracker_type == "wandb":
+        # We need to initialize the resume state here
+        load_config = config.load
+        if load_config.source == "wandb" and load_config.resume:
+            # Then we are resuming the run load_config["run_path"]
+            run_id = load_config.run_path.split("/")[-1]
+            init_config["id"] = run_id
+            init_config["resume"] = "must"
+
+        init_config["entity"] = tracker_config.wandb_entity
+        init_config["project"] = tracker_config.wandb_project
+        tracker = WandbTracker(data_path)
+        tracker.init(**init_config)
+    else:
+        raise ValueError(f"Tracker type {tracker_type} not supported by decoder trainer")
+    return tracker
+    
+def initialize_training(config):
+    # Create the save path
+    if "cuda" in config.train.device:
+        assert torch.cuda.is_available(), "CUDA is not available"
+    device = torch.device(config.train.device)
+    torch.cuda.set_device(device)
+    all_shards = list(range(config.data.start_shard, config.data.end_shard + 1))
+
+    dataloaders = create_dataloaders (
+        available_shards=all_shards,
+        img_preproc = config.data.img_preproc,
+        train_prop = config.data.splits.train,
+        val_prop = config.data.splits.val,
+        test_prop = config.data.splits.test,
+        n_sample_images=config.train.n_sample_images,
+        **config.data.dict()
+    )
+
+    decoder = config.decoder.create().to(device = device)
+    num_parameters = sum(p.numel() for p in decoder.parameters())
+    print(print_ribbon("Loaded Config", repeat=40))
+    print(f"Number of parameters: {num_parameters}")
+
+    tracker = create_tracker(config, **config.tracker.dict())
+
+    train(dataloaders, decoder, 
+        tracker=tracker,
+        inference_device=device,
+        load_config=config.load,
+        evaluate_config=config.evaluate,
+        **config.train.dict(),
+    )
+
+# Create a simple click command line interface to load the config and start the training
+@click.command()
+@click.option("--config_file", default="./train_decoder_config.json", help="Path to config file")
+def main(config_file):
+    print("Recalling config from {}".format(config_file))
+    config = TrainDecoderConfig.from_json_path(config_file)
+    initialize_training(config)
+
+
+if __name__ == "__main__":
+    main()
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -1,7 +1,6 @@
 from pathlib import Path
 import click
 import math
-import time
 import numpy as np

 import torch
@@ -10,9 +9,10 @@ from torch import nn

 from dalle2_pytorch.dataloaders import make_splits
 from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork, OpenAIClipAdapter
-from dalle2_pytorch.trainer import DiffusionPriorTrainer, load_diffusion_model, save_diffusion_model, print_ribbon
+from dalle2_pytorch.trainer import DiffusionPriorTrainer, load_diffusion_model, save_diffusion_model

 from dalle2_pytorch.trackers import ConsoleTracker, WandbTracker
+from dalle2_pytorch.utils import Timer, print_ribbon

 from embedding_reader import EmbeddingReader

@@ -29,16 +29,6 @@ tracker = WandbTracker()
 def exists(val):
    val is not None

-class Timer:
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        self.last_time = time.time()
-
-    def elapsed(self):
-        return time.time() - self.last_time
-
 # functions

 def eval_model(model, dataloader, text_conditioned, loss_type, phase="Validation"):
Author	SHA1	Message	Date
Phil Wang	8864fd0aa7	bring in the dynamic thresholding technique from the Imagen paper, which purportedly improves classifier free guidance for the cascading ddpm	2022-05-24 18:15:14 -07:00
Phil Wang	72bf159331	update	2022-05-24 08:25:40 -07:00
Phil Wang	e5e47cfecb	link to aidan's test run	2022-05-23 12:41:46 -07:00
Phil Wang	fa533962bd	just use an assert to make sure clip image channels is never different than the channels of the diffusion prior and decoder, if clip is given	2022-05-22 22:43:14 -07:00
Phil Wang	276abf337b	fix and cleanup image size determination logic in decoder	2022-05-22 22:28:45 -07:00
Phil Wang	ae42d03006	allow for saving of additional fields on save method in trainers, and return loaded objects from the load method	2022-05-22 22:14:25 -07:00
Phil Wang	4d346e98d9	allow for config driven creation of clip-less diffusion prior	2022-05-22 20:36:20 -07:00
Phil Wang	2b1fd1ad2e	product management	2022-05-22 19:23:40 -07:00
zion	82a2ef37d9	Update README.md (#109 ) block in a section that links to available pre-trained models for those who are interested	2022-05-22 19:22:30 -07:00
Phil Wang	5c397c9d66	move neural network creations off the configuration file into the pydantic classes	2022-05-22 19:18:18 -07:00
Phil Wang	0f4edff214	derived value for image preprocessing belongs to the data config class	2022-05-22 18:42:40 -07:00
Phil Wang	501a8c7c46	small cleanup	2022-05-22 15:39:38 -07:00
Phil Wang	4e49373fc5	project management	2022-05-22 15:27:40 -07:00
Phil Wang	49de72040c	fix decoder trainer optimizer loading (since there are multiple for each unet), also save and load step number correctly	2022-05-22 15:21:00 -07:00
Phil Wang	271a376eaf	0.4.3	2022-05-22 15:10:28 -07:00
Phil Wang	e527002472	take care of saving and loading functions on the diffusion prior and decoder training classes	2022-05-22 15:10:15 -07:00
Phil Wang	c12e067178	let the pydantic config base model take care of loading configuration from json path	2022-05-22 14:47:23 -07:00
Phil Wang	c6629c431a	make training splits into its own pydantic base model, validate it sums to 1, make decoder script cleaner	2022-05-22 14:43:22 -07:00
Phil Wang	7ac2fc79f2	add renamed train decoder json file	2022-05-22 14:32:50 -07:00
Phil Wang	a1ef023193	use pydantic to manage decoder training configs + defaults and refactor training script	2022-05-22 14:27:40 -07:00
Phil Wang	d49eca62fa	dep	2022-05-21 11:27:52 -07:00
Phil Wang	8aab69b91e	final thought	2022-05-21 10:47:45 -07:00
Phil Wang	b432df2f7b	final cleanup to decoder script	2022-05-21 10:42:16 -07:00
Phil Wang	ebaa0d28c2	product management	2022-05-21 10:30:52 -07:00
Phil Wang	8b0d459b25	move config parsing logic to own file, consider whether to find an off-the-shelf solution at future date	2022-05-21 10:30:10 -07:00
Phil Wang	0064661729	small cleanup of decoder train script	2022-05-21 10:17:13 -07:00
Phil Wang	b895f52843	appreciation section	2022-05-21 08:32:12 -07:00
Phil Wang	80497e9839	accept unets as list for decoder	2022-05-20 20:31:26 -07:00
Phil Wang	f526f14d7c	bump	2022-05-20 20:20:40 -07:00
Phil Wang	8997f178d6	small cleanup with timer	2022-05-20 20:05:01 -07:00
Aidan Dempster	022c94e443	Added single GPU training script for decoder (#108 ) Added config files for training Changed example image generation to be more efficient Added configuration description to README Removed unused import	2022-05-20 19:46:19 -07:00
Phil Wang	430961cb97	it was correct the first time, my bad	2022-05-20 18:05:15 -07:00
Phil Wang	721f9687c1	fix wandb logging in tracker, and do some cleanup	2022-05-20 17:27:43 -07:00