fix self conditioning shape in diffusion prior

make self conditioning technique work with diffusion prior
comment
2026-02-12 11:34:29 +01:00 · 2022-08-12 12:29:25 -07:00 · 2022-08-12 12:20:51 -07:00 · 2022-08-12 11:41:23 -07:00 · 2022-08-12 11:36:08 -07:00 · 2022-08-02 19:21:44 -07:00
31 changed files with 3551 additions and 1038 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -1 +1 @@
-github: [lucidrains]
+github: [nousr, Veldrovive, lucidrains]
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -0,0 +1,33 @@
+name: Continuous integration
+
+on:
+  push:
+    branches:
+    - main
+  pull_request:
+    branches:
+    - main
+
+jobs:
+  tests:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.8]
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install
+      run: |
+        python3 -m venv .env
+        source .env/bin/activate
+        make install
+    - name: Tests
+      run: |
+        source .env/bin/activate
+        make test
+
--- a/.gitignore
+++ b/.gitignore
@@ -136,3 +136,5 @@ dmypy.json

 # Pyre type checker
 .pyre/
+.tracker_data
+*.pth
--- a/6
+++ b/6
@@ -0,0 +1,6 @@
+install:
+	pip install -U pip
+	pip install -e .
+
+test:
+	CUDA_VISIBLE_DEVICES= python train_decoder.py --config_file configs/train_decoder_config.test.json
--- a/README.md
+++ b/README.md
@@ -20,18 +20,20 @@ As of 5/23/22, it is no longer SOTA. SOTA will be <a href="https://github.com/lu

 - Decoder is now verified working for unconditional generation on my experimental setup for Oxford flowers. 2 researchers have also confirmed Decoder is working for them.

-<img src="./samples/oxford.png" width="600px" />
+<img src="./samples/oxford.png" width="450px" />

 *ongoing at 21k steps*

 - <a href="https://twitter.com/Buntworthy/status/1529475416775434240?t=0GEge3Kr9I36cjcUVCQUTg">Justin Pinkney</a> successfully trained the diffusion prior in the repository for his CLIP to Stylegan2 text-to-image application

+- <a href="https://github.com/rom1504">Romain</a> has scaled up training to 800 GPUs with the available scripts without any issues
+
 ## Pre-Trained Models

 - LAION is training prior models. Checkpoints are available on <a href="https://huggingface.co/zenglishuci/conditioned-prior">🤗huggingface</a> and the training statistics are available on <a href="https://wandb.ai/nousr_laion/conditioned-prior/reports/LAION-DALLE2-PyTorch-Prior--VmlldzoyMDI2OTIx">🐝WANDB</a>.
 - Decoder - <a href="https://wandb.ai/veldrovive/dalle2_train_decoder/runs/jkrtg0so?workspace=user-veldrovive">In-progress test run</a> 🚧
 - Decoder - <a href="https://wandb.ai/veldrovive/dalle2_train_decoder/runs/3d5rytsa?workspace=">Another test run with sparse attention</a>
- DALL-E 2 🚧
+- DALL-E 2 🚧 - <a href="https://github.com/LAION-AI/dalle2-laion">DALL-E 2 Laion repository</a>

 ## Appreciation

@@ -42,6 +44,8 @@ This library would not have gotten to this working state without the help of
 - <a href="https://github.com/krish240574">Kumar</a> for working on the initial diffusion training script
 - <a href="https://github.com/rom1504">Romain</a> for the pull request reviews and project management
 - <a href="https://github.com/Ciaohe">He Cao</a> and <a href="https://github.com/xiankgx">xiankgx</a> for the Q&A and for identifying of critical bugs
+- <a href="https://github.com/marunine">Marunine</a> for identifying issues with resizing of the low resolution conditioner, when training the upsampler, in addition to various other bug fixes
+- <a href="https://github.com/malumadev">MalumaDev</a> for proposing the use of pixel shuffle upsampler for fixing checkboard artifacts
 - <a href="https://github.com/crowsonkb">Katherine</a> for her advice
 - <a href="https://stability.ai/">Stability AI</a> for the generous sponsorship
 - <a href="https://huggingface.co">🤗 Huggingface</a> and in particular <a href="https://github.com/sgugger">Sylvain</a> for the <a href="https://github.com/huggingface/accelerate">Accelerate</a> library
@@ -352,7 +356,8 @@ prior_network = DiffusionPriorNetwork(
 diffusion_prior = DiffusionPrior(
    net = prior_network,
    clip = clip,
-    timesteps = 100,
+    timesteps = 1000,
+    sample_timesteps = 64,
    cond_drop_prob = 0.2
 ).cuda()

@@ -366,9 +371,11 @@ loss.backward()
 unet1 = Unet(
    dim = 128,
    image_embed_dim = 512,
+    text_embed_dim = 512,
    cond_dim = 128,
    channels = 3,
-    dim_mults=(1, 2, 4, 8)
+    dim_mults=(1, 2, 4, 8),
+    cond_on_text_encodings = True    # set to True for any unets that need to be conditioned on text encodings
 ).cuda()

 unet2 = Unet(
@@ -385,12 +392,11 @@ decoder = Decoder(
    clip = clip,
    timesteps = 100,
    image_cond_drop_prob = 0.1,
-    text_cond_drop_prob = 0.5,
-    condition_on_text_encodings = False  # set this to True if you wish to condition on text during training and sampling
+    text_cond_drop_prob = 0.5
 ).cuda()

 for unet_number in (1, 2):
-    loss = decoder(images, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
+    loss = decoder(images, text = text, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
    loss.backward()

 # do above for many steps
@@ -416,7 +422,7 @@ For the layperson, no worries, training will all be automated into a CLI tool, a

 ## Training on Preprocessed CLIP Embeddings

-It is likely, when scaling up, that you would first preprocess your images and text into corresponding embeddings before training the prior network. You can do so easily by simply passing in `image_embed`, `text_embed`, and optionally `text_encodings` and `text_mask`
+It is likely, when scaling up, that you would first preprocess your images and text into corresponding embeddings before training the prior network. You can do so easily by simply passing in `image_embed`, `text_embed`, and optionally `text_encodings`

 Working example below

@@ -579,7 +585,9 @@ unet1 = Unet(
    image_embed_dim = 512,
    cond_dim = 128,
    channels = 3,
-    dim_mults=(1, 2, 4, 8)
+    dim_mults=(1, 2, 4, 8),
+    text_embed_dim = 512,
+    cond_on_text_encodings = True  # set to True for any unets that need to be conditioned on text encodings (ex. first unet in cascade)
 ).cuda()

 unet2 = Unet(
@@ -594,14 +602,14 @@ decoder = Decoder(
    unet = (unet1, unet2),
    image_sizes = (128, 256),
    clip = clip,
-    timesteps = 100,
+    timesteps = 1000,
+    sample_timesteps = (250, 27),
    image_cond_drop_prob = 0.1,
-    text_cond_drop_prob = 0.5,
-    condition_on_text_encodings = False  # set this to True if you wish to condition on text during training and sampling
+    text_cond_drop_prob = 0.5
 ).cuda()

 for unet_number in (1, 2):
-    loss = decoder(images, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
+    loss = decoder(images, text = text, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
    loss.backward()

 # do above for many steps
@@ -619,8 +627,96 @@ images = dalle2(
 # save your image (in this example, of size 256x256)
 ```

+Alternatively, you can also use <a href="https://github.com/mlfoundations/open_clip">Open Clip</a>
+
+```bash
+$ pip install open-clip-torch
+```
+
+```python
+from dalle2_pytorch import OpenClipAdapter
+
+clip = OpenClipAdapter()
+```
+
 Now you'll just have to worry about training the Prior and the Decoder!

+## Inpainting
+
+Inpainting is also built into the `Decoder`. You simply have to pass in the `inpaint_image` and `inpaint_mask` (boolean tensor where `True` indicates which regions of the inpaint image to keep)
+
+This repository uses the formulation put forth by <a href="https://arxiv.org/abs/2201.09865">Lugmayr et al. in Repaint</a>
+
+```python
+import torch
+from dalle2_pytorch import Unet, Decoder, CLIP
+
+# trained clip from step 1
+
+clip = CLIP(
+    dim_text = 512,
+    dim_image = 512,
+    dim_latent = 512,
+    num_text_tokens = 49408,
+    text_enc_depth = 6,
+    text_seq_len = 256,
+    text_heads = 8,
+    visual_enc_depth = 6,
+    visual_image_size = 256,
+    visual_patch_size = 32,
+    visual_heads = 8
+).cuda()
+
+# 2 unets for the decoder (a la cascading DDPM)
+
+unet = Unet(
+    dim = 16,
+    image_embed_dim = 512,
+    cond_dim = 128,
+    channels = 3,
+    dim_mults = (1, 1, 1, 1)
+).cuda()
+
+
+# decoder, which contains the unet(s) and clip
+
+decoder = Decoder(
+    clip = clip,
+    unet = (unet,),               # insert both unets in order of low resolution to highest resolution (you can have as many stages as you want here)
+    image_sizes = (256,),         # resolutions, 256 for first unet, 512 for second. these must be unique and in ascending order (matches with the unets passed in)
+    timesteps = 1000,
+    image_cond_drop_prob = 0.1,
+    text_cond_drop_prob = 0.5
+).cuda()
+
+# mock images (get a lot of this)
+
+images = torch.randn(4, 3, 256, 256).cuda()
+
+# feed images into decoder, specifying which unet you want to train
+# each unet can be trained separately, which is one of the benefits of the cascading DDPM scheme
+
+loss = decoder(images, unet_number = 1)
+loss.backward()
+
+# do the above for many steps for both unets
+
+mock_image_embed = torch.randn(1, 512).cuda()
+
+# then to do inpainting
+
+inpaint_image = torch.randn(1, 3, 256, 256).cuda()      # (batch, channels, height, width)
+inpaint_mask = torch.ones(1, 256, 256).bool().cuda()    # (batch, height, width)
+
+inpainted_images = decoder.sample(
+    image_embed = mock_image_embed,
+    inpaint_image = inpaint_image,    # just pass in the inpaint image
+    inpaint_mask = inpaint_mask       # and the mask
+)
+
+inpainted_images.shape # (1, 3, 256, 256)
+```
+
 ## Experimental

 ### DALL-E2 with Latent Diffusion
@@ -777,25 +873,23 @@ unet1 = Unet(
    text_embed_dim = 512,
    cond_dim = 128,
    channels = 3,
-    dim_mults=(1, 2, 4, 8)
+    dim_mults=(1, 2, 4, 8),
+    cond_on_text_encodings = True,
 ).cuda()

 unet2 = Unet(
    dim = 16,
    image_embed_dim = 512,
-    text_embed_dim = 512,
    cond_dim = 128,
    channels = 3,
    dim_mults = (1, 2, 4, 8, 16),
-    cond_on_text_encodings = True
 ).cuda()

 decoder = Decoder(
    unet = (unet1, unet2),
    image_sizes = (128, 256),
    clip = clip,
-    timesteps = 1000,
-    condition_on_text_encodings = True
+    timesteps = 1000
 ).cuda()

 decoder_trainer = DecoderTrainer(
@@ -820,8 +914,8 @@ for unet_number in (1, 2):
 # after much training
 # you can sample from the exponentially moving averaged unets as so

-mock_image_embed = torch.randn(4, 512).cuda()
-images = decoder_trainer.sample(mock_image_embed, text = text) # (4, 3, 256, 256)
+mock_image_embed = torch.randn(32, 512).cuda()
+images = decoder_trainer.sample(image_embed = mock_image_embed, text = text) # (4, 3, 256, 256)
 ```

 ### Diffusion Prior Training
@@ -984,52 +1078,11 @@ dataset = ImageEmbeddingDataset(
 )
 ```

-### Scripts (wip)
+### Scripts

 #### `train_diffusion_prior.py`

-This script allows training the DiffusionPrior on pre-computed text and image embeddings. The working example below elucidates this process.
-Please note that the script internally passes text_embed and image_embed to the DiffusionPrior, unlike the example below.
-
-#### Usage
-
-```bash
-$ python train_diffusion_prior.py
-```
-
-The most significant parameters for the script are as follows:
-
- `image-embed-url`, default = `"https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/"`
-
- `text-embed-url`, default = `"https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/"`
-
- `image-embed-dim`, default = `768` - 768 corresponds to the ViT iL/14 embedding size,change it to what your chosen ViT generates
-
- `learning-rate`, default = `1.1e-4`
-
- `weight-decay`,  default = `6.02e-2`
-
- `max-grad-norm`, default = `0.5`
-
- `batch-size`, default = `10 ** 4`
-
- `num-epochs`, default = `5`
-
- `clip`, default = `None` # Signals the prior to use pre-computed embeddings
-
-## CLI (wip)
-
-```bash
-$ dream 'sharing a sunset at the summit of mount everest with my dog'
-```
-
-Once built, images will be saved to the same directory the command is invoked
-
-<a href="https://github.com/lucidrains/big-sleep">template</a>
-
-## Training CLI (wip)
-
-<a href="https://github.com/lucidrains/stylegan2-pytorch">template</a>
+For detailed information on training the diffusion prior, please refer to the [dedicated readme](prior.md)

 ## Todo

@@ -1068,11 +1121,11 @@ Once built, images will be saved to the same directory the command is invoked
 - [x] bring in skip-layer excitations (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training (doesnt work well)
 - [x] test out grid attention in cascading ddpm locally, decide whether to keep or remove https://arxiv.org/abs/2204.01697 (keeping, seems to be fine)
 - [x] allow for unet to be able to condition non-cross attention style as well
- [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo) - consider https://github.com/lucidrains/uformer-pytorch attention-based unet
- [ ] speed up inference, read up on papers (ddim or diffusion-gan, etc)
- [ ] figure out if possible to augment with external memory, as described in https://arxiv.org/abs/2204.11824
+- [x] speed up inference, read up on papers (ddim)
+- [x] add inpainting ability using resampler from repaint paper https://arxiv.org/abs/2201.09865
+- [x] add the final combination of upsample feature maps, used in unet squared, seems to have an effect in local experiments
+- [ ] consider elucidated dalle2 https://arxiv.org/abs/2206.00364
 - [ ] interface out the vqgan-vae so a pretrained one can be pulled off the shelf to validate latent diffusion + DALL-E2
- [ ] build infilling

 ## Citations

@@ -1112,15 +1165,6 @@ Once built, images will be saved to the same directory the command is invoked
 }
 ```

-```bibtex
-@inproceedings{Tu2022MaxViTMV,
-    title   = {MaxViT: Multi-Axis Vision Transformer},
-    author  = {Zhengzhong Tu and Hossein Talebi and Han Zhang and Feng Yang and Peyman Milanfar and Alan Conrad Bovik and Yinxiao Li},
-    year    = {2022},
-    url     = {https://arxiv.org/abs/2204.01697}
-}
-```
-
 ```bibtex
@article{Yu2021VectorquantizedIM,
    title   = {Vector-quantized Image Modeling with Improved VQGAN},
@@ -1189,4 +1233,35 @@ Once built, images will be saved to the same directory the command is invoked
 }
 ```

+```bibtex
+@article{Saharia2021PaletteID,
+    title   = {Palette: Image-to-Image Diffusion Models},
+    author  = {Chitwan Saharia and William Chan and Huiwen Chang and Chris A. Lee and Jonathan Ho and Tim Salimans and David J. Fleet and Mohammad Norouzi},
+    journal = {ArXiv},
+    year    = {2021},
+    volume  = {abs/2111.05826}
+}
+```
+
+```bibtex
+@article{Lugmayr2022RePaintIU,
+    title   = {RePaint: Inpainting using Denoising Diffusion Probabilistic Models},
+    author  = {Andreas Lugmayr and Martin Danelljan and Andr{\'e}s Romero and Fisher Yu and Radu Timofte and Luc Van Gool},
+    journal = {ArXiv},
+    year    = {2022},
+    volume  = {abs/2201.09865}
+}
+```
+
+```bibtex
+@misc{chen2022analog,
+    title   = {Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning},
+    author  = {Ting Chen and Ruixiang Zhang and Geoffrey Hinton},
+    year    = {2022},
+    eprint  = {2208.04202},
+    archivePrefix = {arXiv},
+    primaryClass = {cs.CV}
+}
+```
+
 *Creating noise from data is easy; creating data from noise is generative modeling.* - <a href="https://arxiv.org/abs/2011.13456">Yang Song's paper</a>
--- a/configs/README.md
+++ b/configs/README.md
@@ -30,6 +30,7 @@ Defines the configuration options for the decoder model. The unets defined above
 | `loss_type` | No | `l2` | The loss function. Options are `l1`, `huber`, or `l2`. |
 | `beta_schedule` | No | `cosine` | The noising schedule. Options are `cosine`, `linear`, `quadratic`, `jsd`, or `sigmoid`. |
 | `learned_variance` | No | `True` | Whether to learn the variance. |
+| `clip` | No | `None` | The clip model to use if embeddings are being generated on the fly. Takes keys `make` and `model` with defaults `openai` and `ViT-L/14`. |

 Any parameter from the `Decoder` constructor can also be given here.

@@ -39,7 +40,8 @@ Settings for creation of the dataloaders.
 | Option | Required | Default | Description |
 | ------ | -------- | ------- | ----------- |
 | `webdataset_base_url` | Yes | N/A | The url of a shard in the webdataset with the shard replaced with `{}`[^1]. |
-| `embeddings_url` | No | N/A | The url of the folder containing embeddings shards. Not required if embeddings are in webdataset. |
+| `img_embeddings_url` | No | `None` | The url of the folder containing image embeddings shards. Not required if embeddings are in webdataset or clip is being used. |
+| `text_embeddings_url` | No | `None` | The url of the folder containing text embeddings shards. Not required if embeddings are in webdataset or clip is being used. |
 | `num_workers` | No | `4` | The number of workers used in the dataloader. |
 | `batch_size` | No | `64` | The batch size. |
 | `start_shard` | No | `0` | Defines the start of the shard range the dataset will recall. |
@@ -67,14 +69,12 @@ Settings for controlling the training hyperparameters.
 | `wd` | No | `0.01` | The weight decay. |
 | `max_grad_norm`| No | `0.5` | The grad norm clipping. |
 | `save_every_n_samples` | No | `100000` | Samples will be generated and a checkpoint will be saved every `save_every_n_samples` samples. |
+| `cond_scale` | No | `1.0` | Conditioning scale to use for sampling. Can also be an array of values, one for each unet. |
 | `device` | No | `cuda:0` | The device to train on. |
 | `epoch_samples` | No | `None` | Limits the number of samples iterated through in each epoch. This must be set if resampling. None means no limit. |
 | `validation_samples` | No | `None` | The number of samples to use for validation. None mean the entire validation set. |
 | `use_ema` | No | `True` | Whether to use exponential moving average models for sampling. |
 | `ema_beta` | No | `0.99` | The ema coefficient. |
-| `save_all` | No | `False` | If True, preserves a checkpoint for every epoch. |
-| `save_latest` | No | `True` | If True, overwrites the `latest.pth` every time the model is saved. |
-| `save_best` | No | `True` | If True, overwrites the `best.pth` every time the model has a lower validation loss than all previous models. |
 | `unet_training_mask` | No | `None` | A boolean array of the same length as the number of unets. If false, the unet is frozen. A value of `None` trains all unets. |

 **<ins>Evaluate</ins>:**
@@ -91,21 +91,95 @@ Each metric can be enabled by setting its configuration. The configuration keys

 **<ins>Tracker</ins>:**

-Selects which tracker to use and configures it.
+Selects how the experiment will be tracked.
 | Option | Required | Default | Description |
 | ------ | -------- | ------- | ----------- |
-| `tracker_type` | No | `console` | Which tracker to use. Currently accepts `console` or `wandb`. |
-| `data_path` | No | `./models` | Where the tracker will store local data. |
-| `verbose` | No | `False` | Enables console logging for non-console trackers. |
+| `data_path` | No | `./.tracker-data` | The path to the folder where temporary tracker data will be saved. |
+| `overwrite_data_path` | No | `False` | If true, the data path will be overwritten. Otherwise, you need to delete it yourself. |
+| `log` | Yes | N/A | Logging configuration. |
+| `load` | No | `None` | Checkpoint loading configuration. |
+| `save` | Yes | N/A | Checkpoint/Model saving configuration. |
+Tracking is split up into three sections:
+* Log: Where to save run metadata and image output. Options are `console` or `wandb`.
+* Load: Where to load a checkpoint from. Options are `local`, `url`, or `wandb`.
+* Save: Where to save a checkpoint to. Options are `local`, `huggingface`, or `wandb`.

-Other configuration options are required for the specific trackers. To see which are required, reference the initializer parameters of each [tracker](../dalle2_pytorch/trackers.py).
+**Logging:**

-**<ins>Load</ins>:**
-
-Selects where to load a pretrained model from.
+All loggers have the following keys:
 | Option | Required | Default | Description |
 | ------ | -------- | ------- | ----------- |
-| `source` | No | `None` | Supports `file` or `wandb`. |
-| `resume` | No | `False` | If the tracker support resuming the run, resume it. |
+| `log_type` | Yes | N/A | The type of logger class to use. |
+| `resume` | No | `False` | For loggers that have the option to resume an old run, resume it using maually input parameters. |
+| `auto_resume` | No | `False` | If true, the logger will attempt to resume an old run using parameters from that previous run. |

-Other configuration options are required for loading from a specific source. To see which are required, reference the load methods at the top of the [tracker file](../dalle2_pytorch/trackers.py).
+If using `console` there is no further configuration than setting `log_type` to `console`.
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `log_type` | Yes | N/A | Must be `console`. |
+
+If using `wandb`
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `log_type` | Yes | N/A | Must be `wandb`. |
+| `wandb_entity` | Yes | N/A | The wandb entity to log to. |
+| `wandb_project` | Yes | N/A | The wandb project save the run to. |
+| `wandb_run_name` | No | `None` | The wandb run name. |
+| `wandb_run_id` | No | `None` | The wandb run id. Used if resuming an old run. |
+
+**Loading:**
+
+All loaders have the following keys:
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `load_from` | Yes | N/A | The type of loader class to use. |
+| `only_auto_resume` | No | `False` | If true, the loader will only load the model if the run is being auto resumed. |
+
+If using `local`
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `load_from` | Yes | N/A | Must be `local`. |
+| `file_path` | Yes | N/A | The path to the checkpoint file. |
+
+If using `url`
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `load_from` | Yes | N/A | Must be `url`. |
+| `url` | Yes | N/A | The url of the checkpoint file. |
+
+If using `wandb`
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `load_from` | Yes | N/A | Must be `wandb`. |
+| `wandb_run_path` | No | `None` | The wandb run path. If `None`, uses the run that is being resumed. |
+| `wandb_file_path` | Yes | N/A | The path to the checkpoint file in the W&B file system. |
+
+**Saving:**
+Unlike `log` and `load`, `save` may be an array of options so that you can save to different locations in a run.
+
+All save locations have these configuration options
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `save_to` | Yes | N/A | Must be `local`, `huggingface`, or `wandb`. |
+| `save_latest_to` | No | `None` | Sets the relative path to save the latest model to. |
+| `save_best_to` | No | `None` | Sets the relative path to save the best model to every time the model has a lower validation loss than all previous models. |
+| `save_meta_to` | No | `None` | The path to save metadata files in. This includes the config files used to start the training. |
+| `save_type` | No | `checkpoint` | The type of save. `checkpoint` saves a checkpoint, `model` saves a model without any fluff (Saves with ema if ema is enabled). |
+
+If using `local`
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `save_to` | Yes | N/A | Must be `local`. |
+
+If using `huggingface`
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `save_to` | Yes | N/A | Must be `huggingface`. |
+| `huggingface_repo` | Yes | N/A | The huggingface repository to save to. |
+| `token_path` | No | `None` | If logging in with the huggingface cli is not possible, point to a token file instead. |
+
+If using `wandb`
+| Option | Required | Default | Description |
+| ------ | -------- | ------- | ----------- |
+| `save_to` | Yes | N/A | Must be `wandb`. |
+| `wandb_run_path` | No | `None` | The wandb run path. If `None`, uses the current run. You will almost always want this to be `None`. |
--- a/configs/train_decoder_config.example.json
+++ b/configs/train_decoder_config.example.json
@@ -20,7 +20,7 @@
    },
    "data": {
        "webdataset_base_url": "pipe:s3cmd get s3://bucket/path/{}.tar -",
-        "embeddings_url": "s3://bucket/embeddings/path/",
+        "img_embeddings_url": "s3://bucket/img_embeddings/path/",
        "num_workers": 4,
        "batch_size": 64,
        "start_shard": 0,
@@ -56,9 +56,6 @@
        "use_ema": true,
        "ema_beta": 0.99,
        "amp": false,
-        "save_all": false,
-        "save_latest": true,
-        "save_best": true,
        "unet_training_mask": [true]
    },
    "evaluate": {
@@ -80,20 +77,33 @@
        }
    },
    "tracker": {
-        "tracker_type": "console",
-        "data_path": "./models",
+        "overwrite_data_path": true,

-        "wandb_entity": "",
-        "wandb_project": "",
+        "log": {
+            "log_type": "wandb",

-        "verbose": false
-    },
-    "load": {
-        "source": null,
+            "wandb_entity": "your_wandb",
+            "wandb_project": "your_project",

-        "run_path": "",
-        "file_path": "",
+            "verbose": true
+        },

-        "resume": false
+        "load": {
+            "load_from": null
+        },
+
+        "save": [{
+            "save_to": "wandb",
+            "save_latest_to": "latest.pth"
+        }, {
+            "save_to": "huggingface",
+            "huggingface_repo": "Veldrovive/test_model",
+
+            "save_latest_to": "path/to/model_dir/latest.pth",
+            "save_best_to": "path/to/model_dir/best.pth",
+            "save_meta_to": "path/to/directory/for/assorted/files",
+
+            "save_type": "model"
+        }]
    }
 }
--- a/configs/train_decoder_config.test.json
+++ b/configs/train_decoder_config.test.json
@@ -0,0 +1,100 @@
+{
+    "decoder": {
+        "unets": [
+            {
+                "dim": 16,
+                "image_embed_dim": 768,
+                "cond_dim": 16,
+                "channels": 3,
+                "dim_mults": [1, 2, 4, 8],
+                "attn_dim_head": 16,
+                "attn_heads": 4,
+		"self_attn": [false, true, true, true]
+            }
+        ],
+        "clip": {
+            "make": "openai",
+            "model": "ViT-L/14"
+        },
+
+	"timesteps": 10,
+        "image_sizes": [64],
+        "channels": 3,
+        "loss_type": "l2",
+        "beta_schedule": ["cosine"],
+        "learned_variance": true
+    },
+    "data": {
+        "webdataset_base_url": "test_data/{}.tar",
+        "num_workers": 4,
+        "batch_size": 4,
+        "start_shard": 0,
+        "end_shard": 9,
+        "shard_width": 1,
+        "index_width": 1,
+        "splits": {
+            "train": 0.75,
+            "val": 0.15,
+            "test": 0.1
+        },
+        "shuffle_train": false,
+        "resample_train": true,
+        "preprocessing": {
+            "RandomResizedCrop": {
+                "size": [224, 224],
+                "scale": [0.75, 1.0],
+                "ratio": [1.0, 1.0]
+            },
+            "ToTensor": true
+        }
+    },
+    "train": {
+        "epochs": 1,
+        "lr": 1e-16,
+        "wd": 0.01,
+        "max_grad_norm": 0.5,
+        "save_every_n_samples": 100,
+        "n_sample_images": 1,
+        "device": "cpu",
+        "epoch_samples": 50,
+        "validation_samples": 5,
+        "use_ema": true,
+        "ema_beta": 0.99,
+        "amp": false,
+        "unet_training_mask": [true]
+    },
+    "evaluate": {
+        "n_evaluation_samples": 2,
+        "FID": {
+            "feature": 64
+        },
+        "IS": {
+            "feature": 64,
+            "splits": 10
+        },
+        "KID": {
+            "feature": 64,
+            "subset_size": 2
+        },
+        "LPIPS": {
+            "net_type": "vgg",
+            "reduction": "mean"
+        }
+    },
+    "tracker": {
+        "overwrite_data_path": true,
+
+	"log": {
+            "log_type": "console"
+	},
+
+        "load": {
+            "load_from": null
+        },
+
+       "save": [{
+            "save_to": "local",
+            "save_latest_to": "latest.pth"
+        }]
+    }
+}
--- a/configs/train_prior_config.example.json
+++ b/configs/train_prior_config.example.json
@@ -1,18 +1,14 @@
 {
    "prior": {
        "clip": {
-            "make": "x-clip",
-            "model": "ViT-L/14",
-            "base_model_kwargs": {
-                "dim_text": 768,
-                "dim_image": 768,
-                "dim_latent": 768
-            }
+            "make": "openai",
+            "model": "ViT-L/14"
        },
        "net": {
            "dim": 768,
            "depth": 12,
            "num_timesteps": 1000,
+            "max_text_len": 77,
            "num_time_embeds": 1,
            "num_image_embeds": 1,
            "num_text_embeds": 1,
@@ -20,8 +16,8 @@
            "heads": 12,
            "ff_mult": 4,
            "norm_out": true,
-            "attn_dropout": 0.0,
-            "ff_dropout": 0.0,
+            "attn_dropout": 0.05,
+            "ff_dropout": 0.05,
            "final_proj": true,
            "normformer": true,
            "rotary_emb": true
@@ -30,6 +26,7 @@
        "image_size": 224,
        "image_channels": 3,
        "timesteps": 1000,
+        "sample_timesteps": 64,
        "cond_drop_prob": 0.1,
        "loss_type": "l2",
        "predict_x_start": true,
@@ -37,34 +34,48 @@
        "condition_on_text_encodings": true
    },
    "data": {
-        "image_url": "https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/",
-        "text_url": "https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/",
-        "meta_url": "https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/laion2B-en-metadata/",
-        "batch_size": 256,
+        "batch_size": 128,
+        "num_data_points": 100000,
+        "eval_every_seconds": 1600,
+        "image_url": "<path to your images>",
+        "meta_url": "<path to your metadata>",
        "splits": {
-            "train": 0.9,
-            "val": 1e-7,
-            "test": 0.0999999
+            "train": 0.8,
+            "val":  0.1,
+            "test": 0.1
        }
    },
    "train": {
-        "epochs": 1,
+        "epochs": 5,
        "lr": 1.1e-4,
        "wd": 6.02e-2,
        "max_grad_norm": 0.5,
        "use_ema": true,
+        "ema_beta": 0.9999,
+        "ema_update_after_step": 50,
+        "warmup_steps": 50,
        "amp": false,
-        "save_every": 10000
-    },
-    "load": {
-        "source": null,
-        "resume": false
+        "save_every_seconds": 3600,
+        "eval_timesteps": [64, 1000],
+        "random_seed": 84513
    },
    "tracker": {
-        "tracker_type": "wandb",
-        "data_path": "./prior_checkpoints",
-        "wandb_entity": "laion",
-        "wandb_project": "diffusion-prior",
-        "verbose": true
+        "data_path": ".prior",
+        "overwrite_data_path": true,
+        "log": {
+            "log_type": "wandb",
+            "wandb_entity": "<your wandb username>",
+            "wandb_project": "prior_debugging",
+            "wandb_resume": false,
+            "verbose": true
+        },
+        "save": [
+            {
+                "save_to": "local",
+                "save_type": "checkpoint",
+                "save_latest_to": ".prior/latest_checkpoint.pth",
+                "save_best_to": ".prior/best_checkpoint.pth"
+            }
+        ]
    }
 }
--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
--- a/dalle2_pytorch/dataloaders/decoder_loader.py
+++ b/dalle2_pytorch/dataloaders/decoder_loader.py
@@ -1,6 +1,7 @@
 import os
 import webdataset as wds
 import torch
+from torch.utils.data import DataLoader
 import numpy as np
 import fsspec
 import shutil
@@ -21,7 +22,7 @@ def get_example_file(fs, path, file_format):
    """
    return fs.glob(os.path.join(path, f"*.{file_format}"))[0]

-def embedding_inserter(samples, embeddings_url, index_width, handler=wds.handlers.reraise_exception):
+def embedding_inserter(samples, embeddings_url, index_width, sample_key='npy', handler=wds.handlers.reraise_exception):
    """Given a datum of {"__key__": str, "__url__": str, ...} adds the cooresponding embedding and yields"""
    previous_tar_url = None
    current_embeddings = None
@@ -56,7 +57,7 @@ def embedding_inserter(samples, embeddings_url, index_width, handler=wds.handler
            # We need to check if this sample is nonzero. If it is, this embedding is not valid and we should continue to the next loop
            if torch.count_nonzero(embedding) == 0:
                raise RuntimeError(f"Webdataset had a sample, but no embedding was found. ImgShard: {key[:-index_width]} - Index: {key[-index_width:]}")
-            sample["npy"] = embedding
+            sample[sample_key] = embedding
            yield sample
        except Exception as exn:  # From wds implementation
            if handler(exn):
@@ -84,18 +85,20 @@ def unassociated_shard_skipper(tarfiles, embeddings_url, handler=wds.handlers.re
                continue
            else:
                break
-    
 skip_unassociated_shards = wds.filters.pipelinefilter(unassociated_shard_skipper)

-def verify_keys(samples, handler=wds.handlers.reraise_exception):
+def join_embeddings(samples, handler=wds.handlers.reraise_exception):
    """
-    Requires that both the image and embedding are present in the sample
-    This is important to do as a user may forget they do not have embeddings in their webdataset and neglect to add them using the embedding_folder_url parameter.
+    Takes the img_emb and text_emb keys and turns them into one key "emb": { "text": text_emb, "img": img_emb }
+    either or both of text_emb and img_emb may not be in the sample so we only add the ones that exist
    """
    for sample in samples:
        try:
-            assert "jpg" in sample, f"Sample {sample['__key__']} missing image"
-            assert "npy" in sample, f"Sample {sample['__key__']} missing embedding. Did you set embedding_folder_url?"
+            sample['emb'] = {}
+            if 'text_emb' in sample:
+                sample['emb']['text'] = sample['text_emb']
+            if 'img_emb' in sample:
+                sample['emb']['img'] = sample['img_emb']
            yield sample
        except Exception as exn:  # From wds implementation
            if handler(exn):
@@ -103,6 +106,23 @@ def verify_keys(samples, handler=wds.handlers.reraise_exception):
            else:
                break

+def verify_keys(samples, required_keys, handler=wds.handlers.reraise_exception):
+    """
+    Requires that both the image and embedding are present in the sample
+    This is important to do as a user may forget they do not have embeddings in their webdataset and neglect to add them using the embedding_folder_url parameter.
+    """
+    for sample in samples:
+        try:
+            for key in required_keys:
+                assert key in sample, f"Sample {sample['__key__']} missing {key}. Has keys {sample.keys()}"
+            yield sample
+        except Exception as exn:  # From wds implementation
+            if handler(exn):
+                continue
+            else:
+                break
+key_verifier = wds.filters.pipelinefilter(verify_keys)
+
 class ImageEmbeddingDataset(wds.DataPipeline, wds.compat.FluidInterface):
    """
    A fluid interface wrapper for DataPipline that returns image embedding pairs
@@ -112,7 +132,8 @@ class ImageEmbeddingDataset(wds.DataPipeline, wds.compat.FluidInterface):
    def __init__(
            self,
            urls,
-            embedding_folder_url=None,
+            img_embedding_folder_url=None,
+            text_embedding_folder_url=None,
            index_width=None,
            img_preproc=None,
            extra_keys=[],
@@ -136,7 +157,12 @@ class ImageEmbeddingDataset(wds.DataPipeline, wds.compat.FluidInterface):

        """
        super().__init__()
-        keys = ["jpg", "npy"] + extra_keys
+        keys = ["jpg", "emb"] + extra_keys
+        # if img_embedding_folder_url is not None:
+        #     keys.append("img_emb")
+        # if text_embedding_folder_url is not None:
+        #     keys.append("text_emb")
+        # keys.extend(extra_keys)
        self.key_map = {key: i for i, key in enumerate(keys)}
        self.resampling = resample
        self.img_preproc = img_preproc
@@ -145,7 +171,7 @@ class ImageEmbeddingDataset(wds.DataPipeline, wds.compat.FluidInterface):
            # Then this has an s3 link for the webdataset and we need extra packages
            if shutil.which("s3cmd") is None:
                raise RuntimeError("s3cmd is required for s3 webdataset")
-        if "s3:" in embedding_folder_url:
+        if (img_embedding_folder_url is not None and "s3:" in img_embedding_folder_url) or (text_embedding_folder_url is not None and "s3:" in text_embedding_folder_url):
            # Then the embeddings are being loaded from s3 and fsspec requires s3fs
            try:
                import s3fs
@@ -160,17 +186,24 @@ class ImageEmbeddingDataset(wds.DataPipeline, wds.compat.FluidInterface):
            if shuffle_shards:
                self.append(wds.filters.shuffle(1000))
        
-        if embedding_folder_url is not None:
+        if img_embedding_folder_url is not None:
            # There may be webdataset shards that do not have a embedding shard associated with it. If we do not skip these, they would cause issues.
-            self.append(skip_unassociated_shards(embeddings_url=embedding_folder_url, handler=handler))
+            self.append(skip_unassociated_shards(embeddings_url=img_embedding_folder_url, handler=handler))
+        if text_embedding_folder_url is not None:
+            self.append(skip_unassociated_shards(embeddings_url=text_embedding_folder_url, handler=handler))

        self.append(wds.tarfile_to_samples(handler=handler))
        self.append(wds.decode("pilrgb", handler=handler))
-        if embedding_folder_url is not None:
-            # Then we are loading embeddings for a remote source
+        if img_embedding_folder_url is not None:
+            # Then we are loading image embeddings for a remote source
            assert index_width is not None, "Reading embeddings separately requires index width length to be given"
-            self.append(insert_embedding(embeddings_url=embedding_folder_url, index_width=index_width, handler=handler))
-        self.append(verify_keys)
+            self.append(insert_embedding(embeddings_url=img_embedding_folder_url, index_width=index_width, sample_key='img_emb', handler=handler))
+        if text_embedding_folder_url is not None:
+            # Then we are loading image embeddings for a remote source
+            assert index_width is not None, "Reading embeddings separately requires index width length to be given"
+            self.append(insert_embedding(embeddings_url=text_embedding_folder_url, index_width=index_width, sample_key='text_emb', handler=handler))
+        self.append(join_embeddings)
+        self.append(key_verifier(required_keys=keys, handler=handler))
        # Apply preprocessing
        self.append(wds.map(self.preproc))
        self.append(wds.to_tuple(*keys))
@@ -185,7 +218,8 @@ def create_image_embedding_dataloader(
    tar_url,
    num_workers,
    batch_size,
-    embeddings_url=None,
+    img_embeddings_url=None,
+    text_embeddings_url=None,
    index_width=None,
    shuffle_num = None,
    shuffle_shards = True,
@@ -211,7 +245,8 @@ def create_image_embedding_dataloader(
    """
    ds = ImageEmbeddingDataset(
        tar_url,
-        embeddings_url,
+        img_embedding_folder_url=img_embeddings_url,
+        text_embedding_folder_url=text_embeddings_url,
        index_width=index_width,
        shuffle_shards=shuffle_shards,
        resample=resample_shards,
@@ -221,11 +256,11 @@ def create_image_embedding_dataloader(
    )
    if shuffle_num is not None and shuffle_num > 0:
        ds.shuffle(1000)
-    return wds.WebLoader(
+    return DataLoader(
        ds,
        num_workers=num_workers,
        batch_size=batch_size,
        prefetch_factor=2,  # This might be good to have high so the next npy file is prefetched
        pin_memory=True,
        shuffle=False
-    )
+    )
--- a/dalle2_pytorch/dataloaders/prior_loader.py
+++ b/dalle2_pytorch/dataloaders/prior_loader.py
@@ -67,6 +67,15 @@ class PriorEmbeddingDataset(IterableDataset):
    def __str__(self):
        return f"<PriorEmbeddingDataset: start: {self.start}, stop: {self.stop}, len: {self.__len__()}>"

+    def set_start(self, start):
+        """
+        Adjust the starting point within the reader, useful for resuming an epoch
+        """
+        self.start = start
+
+    def get_start(self):
+        return self.start
+
    def get_sample(self):
        """
        pre-proocess data from either reader into a common format
--- a/dalle2_pytorch/trackers.py
+++ b/dalle2_pytorch/trackers.py
@@ -1,12 +1,18 @@
+import urllib.request
 import os
+import json
 from pathlib import Path
-import importlib
+import shutil
 from itertools import zip_longest
+from typing import Any, Optional, List, Union
+from pydantic import BaseModel

 import torch
-from torch import nn
-
+from dalle2_pytorch.dalle2_pytorch import Decoder, DiffusionPrior
 from dalle2_pytorch.utils import import_or_print_error
+from dalle2_pytorch.trainer import DecoderTrainer, DiffusionPriorTrainer
+from dalle2_pytorch.version import __version__
+from packaging import version

 # constants

@@ -17,136 +23,579 @@ DEFAULT_DATA_PATH = './.tracker-data'
 def exists(val):
    return val is not None

-# load file functions
-
-def load_wandb_file(run_path, file_path, **kwargs):
-    wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb recall function')
-    file_reference = wandb.restore(file_path, run_path=run_path)
-    return file_reference.name
-
-def load_local_file(file_path, **kwargs):
-    return file_path
-
-# base class
-
-class BaseTracker(nn.Module):
-    def __init__(self, data_path = DEFAULT_DATA_PATH):
-        super().__init__()
+class BaseLogger:
+    """
+    An abstract class representing an object that can log data.
+    Parameters:
+        data_path (str): A file path for storing temporary data.
+        verbose (bool): Whether of not to always print logs to the console.
+    """
+    def __init__(self, data_path: str, resume: bool = False, auto_resume: bool = False, verbose: bool = False, **kwargs):
        self.data_path = Path(data_path)
-        self.data_path.mkdir(parents = True, exist_ok = True)
+        self.resume = resume
+        self.auto_resume = auto_resume
+        self.verbose = verbose

-    def init(self, config, **kwargs):
-        raise NotImplementedError
-
-    def log(self, log, **kwargs):
-        raise NotImplementedError
-
-    def log_images(self, images, **kwargs):
-        raise NotImplementedError
-
-    def save_state_dict(self, state_dict, relative_path, **kwargs):
-        raise NotImplementedError
-
-    def recall_state_dict(self, recall_source, *args, **kwargs):
+    def init(self, full_config: BaseModel, extra_config: dict, **kwargs) -> None:
        """
-        Loads a state dict from any source.
-        Since a user may wish to load a model from a different source than their own tracker (i.e. tracking using wandb but recalling from disk),
-            this should not be linked to any individual tracker.
+        Initializes the logger.
+        Errors if the logger is invalid.
+        full_config is the config file dict while extra_config is anything else from the script that is not defined the config file.
        """
-        # TODO: Pull this into a dict or something similar so that we can add more sources without having a massive switch statement
-        if recall_source == 'wandb':
-            return torch.load(load_wandb_file(*args, **kwargs))
-        elif recall_source == 'local':
-            return torch.load(load_local_file(*args, **kwargs))
-        else:
-            raise ValueError('`recall_source` must be one of `wandb` or `local`')
-
-    def save_file(self, file_path, **kwargs):
        raise NotImplementedError

-    def recall_file(self, recall_source, *args, **kwargs):
-        if recall_source == 'wandb':
-            return load_wandb_file(*args, **kwargs)
-        elif recall_source == 'local':
-            return load_local_file(*args, **kwargs)
-        else:
-            raise ValueError('`recall_source` must be one of `wandb` or `local`')
+    def log(self, log, **kwargs) -> None:
+        raise NotImplementedError

-# Tracker that no-ops all calls except for recall
+    def log_images(self, images, captions=[], image_section="images", **kwargs) -> None:
+        raise NotImplementedError

-class DummyTracker(BaseTracker):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
+    def log_file(self, file_path, **kwargs) -> None:
+        raise NotImplementedError

-    def init(self, config, **kwargs):
-        pass
+    def log_error(self, error_string, **kwargs) -> None:
+        raise NotImplementedError

-    def log(self, log, **kwargs):
-        pass
+    def get_resume_data(self, **kwargs) -> dict:
+        """
+        Sets tracker attributes that along with { "resume": True } will be used to resume training.
+        It is assumed that after init is called this data will be complete.
+        If the logger does not have any resume functionality, it should return an empty dict.
+        """
+        raise NotImplementedError

-    def log_images(self, images, **kwargs):
-        pass
+class ConsoleLogger(BaseLogger):
+    def init(self, full_config: BaseModel, extra_config: dict, **kwargs) -> None:
+        print("Logging to console")

-    def save_state_dict(self, state_dict, relative_path, **kwargs):
-        pass
-
-    def save_file(self, file_path, **kwargs):
-        pass
-
-# basic stdout class
-
-class ConsoleTracker(BaseTracker):
-    def init(self, **config):
-        print(config)
-
-    def log(self, log, **kwargs):
+    def log(self, log, **kwargs) -> None:
        print(log)

-    def log_images(self, images, **kwargs): # noop for logging images
-        pass
-    
-    def save_state_dict(self, state_dict, relative_path, **kwargs):
-        torch.save(state_dict, str(self.data_path / relative_path))
-    
-    def save_file(self, file_path, **kwargs):
-        # This is a no-op for local file systems since it is already saved locally
+    def log_images(self, images, captions=[], image_section="images", **kwargs) -> None:
        pass

-# basic wandb class
+    def log_file(self, file_path, **kwargs) -> None:
+        pass

-class WandbTracker(BaseTracker):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb experiment tracker')
+    def log_error(self, error_string, **kwargs) -> None:
+        print(error_string)
+
+    def get_resume_data(self, **kwargs) -> dict:
+        return {}
+
+class WandbLogger(BaseLogger):
+    """
+    Logs to a wandb run.
+    Parameters:
+        data_path (str): A file path for storing temporary data.
+        wandb_entity (str): The wandb entity to log to.
+        wandb_project (str): The wandb project to log to.
+        wandb_run_id (str): The wandb run id to resume.
+        wandb_run_name (str): The wandb run name to use.
+    """
+    def __init__(self,
+        data_path: str,
+        wandb_entity: str,
+        wandb_project: str,
+        wandb_run_id: Optional[str] = None,
+        wandb_run_name: Optional[str] = None,
+        **kwargs
+    ):
+        super().__init__(data_path, **kwargs)
+        self.entity = wandb_entity
+        self.project = wandb_project
+        self.run_id = wandb_run_id
+        self.run_name = wandb_run_name
+
+    def init(self, full_config: BaseModel, extra_config: dict, **kwargs) -> None:
+        assert self.entity is not None, "wandb_entity must be specified for wandb logger"
+        assert self.project is not None, "wandb_project must be specified for wandb logger"
+        self.wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb logger')
        os.environ["WANDB_SILENT"] = "true"
+        # Initializes the wandb run
+        init_object = {
+            "entity": self.entity,
+            "project": self.project,
+            "config": {**full_config.dict(), **extra_config}
+        }
+        if self.run_name is not None:
+            init_object['name'] = self.run_name
+        if self.resume:
+            assert self.run_id is not None, '`wandb_run_id` must be provided if `wandb_resume` is True'
+            if self.run_name is not None:
+                print("You are renaming a run. I hope that is what you intended.")
+            init_object['resume'] = 'must'
+            init_object['id'] = self.run_id

-    def init(self, **config):
-        self.wandb.init(**config)
+        self.wandb.init(**init_object)
+        print(f"Logging to wandb run {self.wandb.run.path}-{self.wandb.run.name}")

-    def log(self, log, verbose=False, **kwargs):
-        if verbose:
+    def log(self, log, **kwargs) -> None:
+        if self.verbose:
            print(log)
        self.wandb.log(log, **kwargs)

-    def log_images(self, images, captions=[], image_section="images", **kwargs):
+    def log_images(self, images, captions=[], image_section="images", **kwargs) -> None:
        """
        Takes a tensor of images and a list of captions and logs them to wandb.
        """
        wandb_images = [self.wandb.Image(image, caption=caption) for image, caption in zip_longest(images, captions)]
-        self.log({ image_section: wandb_images }, **kwargs)
-    
-    def save_state_dict(self, state_dict, relative_path, **kwargs):
-        """
-        Saves a state_dict to disk and uploads it 
-        """
-        full_path = str(self.data_path / relative_path)
-        torch.save(state_dict, full_path)
-        self.wandb.save(full_path, base_path = str(self.data_path))  # Upload and keep relative to data_path
+        self.wandb.log({ image_section: wandb_images }, **kwargs)

-    def save_file(self, file_path, base_path=None, **kwargs):
-        """
-        Uploads a file from disk to wandb
-        """
+    def log_file(self, file_path, base_path: Optional[str] = None, **kwargs) -> None:
        if base_path is None:
-            base_path = self.data_path
+            # Then we take the basepath as the parent of the file_path
+            base_path = Path(file_path).parent
        self.wandb.save(str(file_path), base_path = str(base_path))
+
+    def log_error(self, error_string, step=None, **kwargs) -> None:
+        if self.verbose:
+            print(error_string)
+        self.wandb.log({"error": error_string, **kwargs}, step=step)
+
+    def get_resume_data(self, **kwargs) -> dict:
+        # In order to resume, we need wandb_entity, wandb_project, and wandb_run_id
+        return {
+            "entity": self.entity,
+            "project": self.project,
+            "run_id": self.wandb.run.id
+        }
+
+logger_type_map = {
+    'console': ConsoleLogger,
+    'wandb': WandbLogger,
+}
+def create_logger(logger_type: str, data_path: str, **kwargs) -> BaseLogger:
+    if logger_type == 'custom':
+        raise NotImplementedError('Custom loggers are not supported yet. Please use a different logger type.')
+    try:
+        logger_class = logger_type_map[logger_type]
+    except KeyError:
+        raise ValueError(f'Unknown logger type: {logger_type}. Must be one of {list(logger_type_map.keys())}')
+    return logger_class(data_path, **kwargs)
+
+class BaseLoader:
+    """
+    An abstract class representing an object that can load a model checkpoint.
+    Parameters:
+        data_path (str): A file path for storing temporary data.
+    """
+    def __init__(self, data_path: str, only_auto_resume: bool = False, **kwargs):
+        self.data_path = Path(data_path)
+        self.only_auto_resume = only_auto_resume
+
+    def init(self, logger: BaseLogger, **kwargs) -> None:
+        raise NotImplementedError
+
+    def recall() -> dict:
+        raise NotImplementedError
+
+class UrlLoader(BaseLoader):
+    """
+    A loader that downloads the file from a url and loads it
+    Parameters:
+        data_path (str): A file path for storing temporary data.
+        url (str): The url to download the file from.
+    """
+    def __init__(self, data_path: str, url: str, **kwargs):
+        super().__init__(data_path, **kwargs)
+        self.url = url
+
+    def init(self, logger: BaseLogger, **kwargs) -> None:
+        # Makes sure the file exists to be downloaded
+        pass  # TODO: Actually implement that
+
+    def recall(self) -> dict:
+        # Download the file
+        save_path = self.data_path / 'loaded_checkpoint.pth'
+        urllib.request.urlretrieve(self.url, str(save_path))
+        # Load the file
+        return torch.load(str(save_path), map_location='cpu')
+        
+
+class LocalLoader(BaseLoader):
+    """
+    A loader that loads a file from a local path
+    Parameters:
+        data_path (str): A file path for storing temporary data.
+        file_path (str): The path to the file to load.
+    """
+    def __init__(self, data_path: str, file_path: str, **kwargs):
+        super().__init__(data_path, **kwargs)
+        self.file_path = Path(file_path)
+
+    def init(self, logger: BaseLogger, **kwargs) -> None:
+        # Makes sure the file exists to be loaded
+        if not self.file_path.exists() and not self.only_auto_resume:
+            raise FileNotFoundError(f'Model not found at {self.file_path}')
+
+    def recall(self) -> dict:
+        # Load the file
+        return torch.load(str(self.file_path), map_location='cpu')
+
+class WandbLoader(BaseLoader):
+    """
+    A loader that loads a model from an existing wandb run
+    """
+    def __init__(self, data_path: str, wandb_file_path: str, wandb_run_path: Optional[str] = None, **kwargs):
+        super().__init__(data_path, **kwargs)
+        self.run_path = wandb_run_path
+        self.file_path = wandb_file_path
+
+    def init(self, logger: BaseLogger, **kwargs) -> None:
+        self.wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb recall function')
+        # Make sure the file can be downloaded
+        if self.wandb.run is not None and self.run_path is None:
+            self.run_path = self.wandb.run.path
+            assert self.run_path is not None, 'wandb run was not found to load from. If not using the wandb logger must specify the `wandb_run_path`.'
+        assert self.run_path is not None, '`wandb_run_path` must be provided for the wandb loader'
+        assert self.file_path is not None, '`wandb_file_path` must be provided for the wandb loader'
+        
+        os.environ["WANDB_SILENT"] = "true"
+        pass  # TODO: Actually implement that
+
+    def recall(self) -> dict:
+        file_reference = self.wandb.restore(self.file_path, run_path=self.run_path)
+        return torch.load(file_reference.name, map_location='cpu')
+
+loader_type_map = {
+    'url': UrlLoader,
+    'local': LocalLoader,
+    'wandb': WandbLoader,
+}
+def create_loader(loader_type: str, data_path: str, **kwargs) -> BaseLoader:
+    if loader_type == 'custom':
+        raise NotImplementedError('Custom loaders are not supported yet. Please use a different loader type.')
+    try:
+        loader_class = loader_type_map[loader_type]
+    except KeyError:
+        raise ValueError(f'Unknown loader type: {loader_type}. Must be one of {list(loader_type_map.keys())}')
+    return loader_class(data_path, **kwargs)
+
+class BaseSaver:
+    def __init__(self,
+        data_path: str,
+        save_latest_to: Optional[Union[str, bool]] = None,
+        save_best_to: Optional[Union[str, bool]] = None,
+        save_meta_to: Optional[str] = None,
+        save_type: str = 'checkpoint',
+        **kwargs
+    ):
+        self.data_path = Path(data_path)
+        self.save_latest_to = save_latest_to
+        self.saving_latest = save_latest_to is not None and save_latest_to is not False
+        self.save_best_to = save_best_to
+        self.saving_best = save_best_to is not None and save_best_to is not False
+        self.save_meta_to = save_meta_to
+        self.saving_meta = save_meta_to is not None
+        self.save_type = save_type
+        assert save_type in ['checkpoint', 'model'], '`save_type` must be one of `checkpoint` or `model`'
+        assert self.saving_latest or self.saving_best or self.saving_meta, 'At least one saving option must be specified'
+
+    def init(self, logger: BaseLogger, **kwargs) -> None:
+        raise NotImplementedError
+
+    def save_file(self, local_path: Path, save_path: str, is_best=False, is_latest=False, **kwargs) -> None:
+        """
+        Save a general file under save_meta_to
+        """
+        raise NotImplementedError
+
+class LocalSaver(BaseSaver):
+    def __init__(self,
+        data_path: str,
+        **kwargs
+    ):
+        super().__init__(data_path, **kwargs)
+
+    def init(self, logger: BaseLogger, **kwargs) -> None:
+        # Makes sure the directory exists to be saved to
+        print(f"Saving {self.save_type} locally")
+        if not self.data_path.exists():
+            self.data_path.mkdir(parents=True)
+
+    def save_file(self, local_path: str, save_path: str, **kwargs) -> None:
+        # Copy the file to save_path
+        save_path_file_name = Path(save_path).name
+        # Make sure parent directory exists
+        save_path_parent = Path(save_path).parent
+        if not save_path_parent.exists():
+            save_path_parent.mkdir(parents=True)
+        print(f"Saving {save_path_file_name} {self.save_type} to local path {save_path}")
+        shutil.copy(local_path, save_path)
+
+class WandbSaver(BaseSaver):
+    def __init__(self, data_path: str, wandb_run_path: Optional[str] = None, **kwargs):
+        super().__init__(data_path, **kwargs)
+        self.run_path = wandb_run_path
+
+    def init(self, logger: BaseLogger, **kwargs) -> None:
+        self.wandb = import_or_print_error('wandb', '`pip install wandb` to use the wandb logger')
+        os.environ["WANDB_SILENT"] = "true"
+        # Makes sure that the user can upload tot his run
+        if self.run_path is not None:
+            entity, project, run_id = self.run_path.split("/")
+            self.run = self.wandb.init(entity=entity, project=project, id=run_id)
+        else:
+            assert self.wandb.run is not None, 'You must be using the wandb logger if you are saving to wandb and have not set `wandb_run_path`'
+            self.run = self.wandb.run
+        # TODO: Now actually check if upload is possible
+        print(f"Saving to wandb run {self.run.path}-{self.run.name}")
+
+    def save_file(self, local_path: Path, save_path: str, **kwargs) -> None:
+        # In order to log something in the correct place in wandb, we need to have the same file structure here
+        save_path_file_name = Path(save_path).name
+        print(f"Saving {save_path_file_name} {self.save_type} to wandb run {self.run.path}-{self.run.name}")
+        save_path = Path(self.data_path) / save_path
+        save_path.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copy(local_path, save_path)
+        self.run.save(str(save_path), base_path = str(self.data_path), policy='now')
+
+class HuggingfaceSaver(BaseSaver):
+    def __init__(self, data_path: str, huggingface_repo: str, token_path: Optional[str] = None, **kwargs):
+        super().__init__(data_path, **kwargs)
+        self.huggingface_repo = huggingface_repo
+        self.token_path = token_path
+
+    def init(self, logger: BaseLogger, **kwargs):
+        # Makes sure this user can upload to the repo
+        self.hub = import_or_print_error('huggingface_hub', '`pip install huggingface_hub` to use the huggingface saver')
+        try:
+            identity = self.hub.whoami()  # Errors if not logged in
+            # Then we are logged in
+        except:
+            # We are not logged in. Use the token_path to set the token.
+            if not os.path.exists(self.token_path):
+                raise Exception("Not logged in to huggingface and no token_path specified. Please login with `huggingface-cli login` or if that does not work set the token_path.")
+            with open(self.token_path, "r") as f:
+                token = f.read().strip()
+            self.hub.HfApi.set_access_token(token)
+            identity = self.hub.whoami()
+        print(f"Saving to huggingface repo {self.huggingface_repo}")
+
+    def save_file(self, local_path: Path, save_path: str, **kwargs) -> None:
+        # Saving to huggingface is easy, we just need to upload the file with the correct name
+        save_path_file_name = Path(save_path).name
+        print(f"Saving {save_path_file_name} {self.save_type} to huggingface repo {self.huggingface_repo}")
+        self.hub.upload_file(
+            path_or_fileobj=str(local_path),
+            path_in_repo=str(save_path),
+            repo_id=self.huggingface_repo
+        )
+        
+saver_type_map = {
+    'local': LocalSaver,
+    'wandb': WandbSaver,
+    'huggingface': HuggingfaceSaver
+}
+def create_saver(saver_type: str, data_path: str, **kwargs) -> BaseSaver:
+    if saver_type == 'custom':
+        raise NotImplementedError('Custom savers are not supported yet. Please use a different saver type.')
+    try:
+        saver_class = saver_type_map[saver_type]
+    except KeyError:
+        raise ValueError(f'Unknown saver type: {saver_type}. Must be one of {list(saver_type_map.keys())}')
+    return saver_class(data_path, **kwargs)
+
+
+class Tracker:
+    def __init__(self, data_path: Optional[str] = DEFAULT_DATA_PATH, overwrite_data_path: bool = False, dummy_mode: bool = False):
+        self.data_path = Path(data_path)
+        if not dummy_mode:
+            if not overwrite_data_path:
+                assert not self.data_path.exists(), f'Data path {self.data_path} already exists. Set overwrite_data_path to True to overwrite.'
+                if not self.data_path.exists():
+                    self.data_path.mkdir(parents=True)
+        self.logger: BaseLogger = None
+        self.loader: Optional[BaseLoader] = None
+        self.savers: List[BaseSaver]= []
+        self.dummy_mode = dummy_mode
+
+    def _load_auto_resume(self) -> bool:
+        # If the file does not exist, we return False. If autoresume is enabled we print a warning so that the user can know that this is the first run.
+        if not self.auto_resume_path.exists():
+            if self.logger.auto_resume:
+                print("Auto_resume is enabled but no auto_resume.json file exists. Assuming this is the first run.")
+            return False
+
+        # Now we know that the autoresume file exists, but if we are not auto resuming we should remove it so that we don't accidentally load it next time
+        if not self.logger.auto_resume:
+            print(f'Removing auto_resume.json because auto_resume is not enabled in the config')
+            self.auto_resume_path.unlink()
+            return False
+
+        # Otherwise we read the json into a dictionary will will override parts of logger.__dict__
+        with open(self.auto_resume_path, 'r') as f:
+            auto_resume_dict = json.load(f)
+        # Check if the logger is of the same type as the autoresume save
+        if auto_resume_dict["logger_type"] != self.logger.__class__.__name__:
+            raise Exception(f'The logger type in the auto_resume file is {auto_resume_dict["logger_type"]} but the current logger is {self.logger.__class__.__name__}. Either use the original logger type, set `auto_resume` to `False`, or delete your existing tracker-data folder.')
+        # Then we are ready to override the logger with the autoresume save
+        self.logger.__dict__["resume"] = True
+        print(f"Updating {self.logger.__dict__} with {auto_resume_dict}")
+        self.logger.__dict__.update(auto_resume_dict)
+        return True
+
+    def _save_auto_resume(self):
+        # Gets the autoresume dict from the logger and adds "logger_type" to it then saves it to the auto_resume file
+        auto_resume_dict = self.logger.get_resume_data()
+        auto_resume_dict['logger_type'] = self.logger.__class__.__name__
+        with open(self.auto_resume_path, 'w') as f:
+            json.dump(auto_resume_dict, f)
+
+    def init(self, full_config: BaseModel, extra_config: dict):
+        self.auto_resume_path = self.data_path / 'auto_resume.json'
+        # Check for resuming the run
+        self.did_auto_resume = self._load_auto_resume()
+        if self.did_auto_resume:
+            print(f'\n\nWARNING: RUN HAS BEEN AUTO-RESUMED WITH THE LOGGER TYPE {self.logger.__class__.__name__}.\nIf this was not your intention, stop this run and set `auto_resume` to `False` in the config.\n\n')
+            print(f"New logger config: {self.logger.__dict__}")
+        
+        self.save_metadata = dict(
+            version = version.parse(__version__)
+        )  # Data that will be saved alongside the checkpoint or model
+        self.blacklisted_checkpoint_metadata_keys = ['scaler', 'optimizer', 'model', 'version', 'step', 'steps']  # These keys would cause us to error if we try to save them as metadata
+
+        assert self.logger is not None, '`logger` must be set before `init` is called'
+        if self.dummy_mode:
+            # The only thing we need is a loader
+            if self.loader is not None:
+                self.loader.init(self.logger)
+            return
+        assert len(self.savers) > 0, '`savers` must be set before `init` is called'
+
+        self.logger.init(full_config, extra_config)
+        if self.loader is not None:
+            self.loader.init(self.logger)
+        for saver in self.savers:
+            saver.init(self.logger)
+
+        if self.logger.auto_resume:
+            # Then we need to save the autoresume file. It is assumed after logger.init is called that the logger is ready to be saved.
+            self._save_auto_resume()
+
+    def add_logger(self, logger: BaseLogger):
+        self.logger = logger
+
+    def add_loader(self, loader: BaseLoader):
+        self.loader = loader
+
+    def add_saver(self, saver: BaseSaver):
+        self.savers.append(saver)
+
+    def log(self, *args, **kwargs):
+        if self.dummy_mode:
+            return
+        self.logger.log(*args, **kwargs)
+    
+    def log_images(self, *args, **kwargs):
+        if self.dummy_mode:
+            return
+        self.logger.log_images(*args, **kwargs)
+
+    def log_file(self, *args, **kwargs):
+        if self.dummy_mode:
+            return
+        self.logger.log_file(*args, **kwargs)
+
+    def save_config(self, current_config_path: str, config_name = 'config.json'):
+        if self.dummy_mode:
+            return
+        # Save the config under config_name in the root folder of data_path
+        shutil.copy(current_config_path, self.data_path / config_name)
+        for saver in self.savers:
+            if saver.saving_meta:
+                remote_path = Path(saver.save_meta_to) / config_name
+                saver.save_file(current_config_path, str(remote_path))
+
+    def add_save_metadata(self, state_dict_key: str, metadata: Any):
+        """
+        Adds a new piece of metadata that will be saved along with the model or decoder.
+        """
+        self.save_metadata[state_dict_key] = metadata
+
+    def _save_state_dict(self, trainer: Union[DiffusionPriorTrainer, DecoderTrainer], save_type: str, file_path: str, **kwargs) -> Path:
+        """
+        Gets the state dict to be saved and writes it to file_path.
+        If save_type is 'checkpoint', we save the entire trainer state dict.
+        If save_type is 'model', we save only the model state dict.
+        """
+        assert save_type in ['checkpoint', 'model']
+        if save_type == 'checkpoint':
+            # Create a metadata dict without the blacklisted keys so we do not error when we create the state dict
+            metadata = {k: v for k, v in self.save_metadata.items() if k not in self.blacklisted_checkpoint_metadata_keys}
+            trainer.save(file_path, overwrite=True, **kwargs, **metadata)
+        elif save_type == 'model':
+            if isinstance(trainer, DiffusionPriorTrainer):
+                prior = trainer.ema_diffusion_prior.ema_model if trainer.use_ema else trainer.diffusion_prior
+                prior: DiffusionPrior = trainer.accelerator.unwrap_model(prior)
+                # Remove CLIP if it is part of the model
+                original_clip = prior.clip
+                prior.clip = None
+                model_state_dict = prior.state_dict()
+                prior.clip = original_clip
+            elif isinstance(trainer, DecoderTrainer):
+                decoder: Decoder = trainer.accelerator.unwrap_model(trainer.decoder)
+                # Remove CLIP if it is part of the model
+                original_clip = decoder.clip
+                decoder.clip = None
+                if trainer.use_ema:
+                    trainable_unets = decoder.unets
+                    decoder.unets = trainer.unets  # Swap EMA unets in
+                    model_state_dict = decoder.state_dict()
+                    decoder.unets = trainable_unets  # Swap back
+                else:
+                    model_state_dict = decoder.state_dict()
+                decoder.clip = original_clip
+            else:
+                raise NotImplementedError('Saving this type of model with EMA mode enabled is not yet implemented. Actually, how did you get here?')
+            state_dict = {
+                **self.save_metadata,
+                'model': model_state_dict
+            }
+            torch.save(state_dict, file_path)
+        return Path(file_path)
+
+    def save(self, trainer, is_best: bool, is_latest: bool, **kwargs):
+        if self.dummy_mode:
+            return
+        if not is_best and not is_latest:
+            # Nothing to do
+            return
+        # Save the checkpoint and model to data_path
+        checkpoint_path = self.data_path / 'checkpoint.pth'
+        self._save_state_dict(trainer, 'checkpoint', checkpoint_path, **kwargs)
+        model_path = self.data_path / 'model.pth'
+        self._save_state_dict(trainer, 'model', model_path, **kwargs)
+        print("Saved cached models")
+        # Call the save methods on the savers
+        for saver in self.savers:
+            local_path = checkpoint_path if saver.save_type == 'checkpoint' else model_path
+            if saver.saving_latest and is_latest:
+                latest_checkpoint_path = saver.save_latest_to.format(**kwargs)
+                try:
+                    saver.save_file(local_path, latest_checkpoint_path, is_latest=True, **kwargs)
+                except Exception as e:
+                    self.logger.log_error(f'Error saving checkpoint: {e}', **kwargs)
+                    print(f'Error saving checkpoint: {e}')
+            if saver.saving_best and is_best:
+                best_checkpoint_path = saver.save_best_to.format(**kwargs)
+                try:
+                    saver.save_file(local_path, best_checkpoint_path, is_best=True, **kwargs)
+                except Exception as e:
+                    self.logger.log_error(f'Error saving checkpoint: {e}', **kwargs)
+                    print(f'Error saving checkpoint: {e}')
+    
+    @property
+    def can_recall(self):
+        # Defines whether a recall can be performed.
+        return self.loader is not None and (not self.loader.only_auto_resume or self.did_auto_resume)
+    
+    def recall(self):
+        if self.can_recall:
+            return self.loader.recall()
+        else:
+            raise ValueError('Tried to recall, but no loader was set or auto-resume was not performed.')
+
+
+    
--- a/dalle2_pytorch/train_configs.py
+++ b/dalle2_pytorch/train_configs.py
@@ -1,7 +1,7 @@
 import json
 from torchvision import transforms as T
 from pydantic import BaseModel, validator, root_validator
-from typing import List, Iterable, Optional, Union, Tuple, Dict, Any
+from typing import List, Optional, Union, Tuple, Dict, Any, TypeVar

 from x_clip import CLIP as XCLIP
 from coca_pytorch import CoCa
@@ -13,8 +13,9 @@ from dalle2_pytorch.dalle2_pytorch import (
    Decoder,
    DiffusionPrior,
    DiffusionPriorNetwork,
-    XClipAdapter,
+    XClipAdapter
 )
+from dalle2_pytorch.trackers import Tracker, create_loader, create_logger, create_saver

 # helper functions

@@ -24,11 +25,9 @@ def exists(val):
 def default(val, d):
    return val if exists(val) else d

-def ListOrTuple(inner_type):
-    return Union[List[inner_type], Tuple[inner_type]]
-
-def SingularOrIterable(inner_type):
-    return Union[inner_type, ListOrTuple(inner_type)]
+InnerType = TypeVar('InnerType')
+ListOrTuple = Union[List[InnerType], Tuple[InnerType]]
+SingularOrIterable = Union[InnerType, ListOrTuple[InnerType]]

 # general pydantic classes

@@ -44,13 +43,69 @@ class TrainSplitConfig(BaseModel):
            raise ValueError(f'{fields.keys()} must sum to 1.0. Found: {actual_sum}')
        return fields

+class TrackerLogConfig(BaseModel):
+    log_type: str = 'console'
+    resume: bool = False  # For logs that are saved to unique locations, resume a previous run
+    auto_resume: bool = False  # If the process crashes and restarts, resume from the run that crashed
+    verbose: bool = False
+
+    class Config:
+        # Each individual log type has it's own arguments that will be passed through the config
+        extra = "allow"
+
+    def create(self, data_path: str):
+        kwargs = self.dict()
+        return create_logger(self.log_type, data_path, **kwargs)
+
+class TrackerLoadConfig(BaseModel):
+    load_from: Optional[str] = None
+    only_auto_resume: bool = False  # Only attempt to load if the logger is auto-resuming
+
+    class Config:
+        extra = "allow"
+
+    def create(self, data_path: str):
+        kwargs = self.dict()
+        if self.load_from is None:
+            return None
+        return create_loader(self.load_from, data_path, **kwargs)
+
+class TrackerSaveConfig(BaseModel):
+    save_to: str = 'local'
+    save_all: bool = False
+    save_latest: bool = True
+    save_best: bool = True
+
+    class Config:
+        extra = "allow"
+
+    def create(self, data_path: str):
+        kwargs = self.dict()
+        return create_saver(self.save_to, data_path, **kwargs)
+
 class TrackerConfig(BaseModel):
-    tracker_type: str = 'console'           # Decoder currently supports console and wandb
-    data_path: str = './models'             # The path where files will be saved locally
-    init_config: Dict[str, Any] = None
-    wandb_entity: str = ''                  # Only needs to be set if tracker_type is wandb
-    wandb_project: str = ''
-    verbose: bool = False                   # Whether to print console logging for non-console trackers
+    data_path: str = '.tracker_data'
+    overwrite_data_path: bool = False
+    log: TrackerLogConfig
+    load: Optional[TrackerLoadConfig]
+    save: Union[List[TrackerSaveConfig], TrackerSaveConfig]
+
+    def create(self, full_config: BaseModel, extra_config: dict, dummy_mode: bool = False) -> Tracker:
+        tracker = Tracker(self.data_path, dummy_mode=dummy_mode, overwrite_data_path=self.overwrite_data_path)
+        # Add the logger
+        tracker.add_logger(self.log.create(self.data_path))
+        # Add the loader
+        if self.load is not None:
+            tracker.add_loader(self.load.create(self.data_path))
+        # Add the saver or savers
+        if isinstance(self.save, list):
+            for save_config in self.save:
+                tracker.add_saver(save_config.create(self.data_path))
+        else:
+            tracker.add_saver(self.save.create(self.data_path))
+        # Initialize all the components and verify that all data is valid
+        tracker.init(full_config, extra_config)
+        return tracker

 # diffusion prior pydantic classes

@@ -72,6 +127,7 @@ class AdapterConfig(BaseModel):
 class DiffusionPriorNetworkConfig(BaseModel):
    dim: int
    depth: int
+    max_text_len: int = None
    num_timesteps: int = None
    num_time_embeds: int = 1
    num_image_embeds: int = 1
@@ -79,6 +135,7 @@ class DiffusionPriorNetworkConfig(BaseModel):
    dim_head: int = 64
    heads: int = 8
    ff_mult: int = 4
+    norm_in: bool = False
    norm_out: bool = True
    attn_dropout: float = 0.
    ff_dropout: float = 0.
@@ -86,6 +143,9 @@ class DiffusionPriorNetworkConfig(BaseModel):
    normformer: bool = False
    rotary_emb: bool = True

+    class Config:
+        extra = "allow"
+
    def create(self):
        kwargs = self.dict()
        return DiffusionPriorNetwork(**kwargs)
@@ -97,6 +157,7 @@ class DiffusionPriorConfig(BaseModel):
    image_size: int
    image_channels: int = 3
    timesteps: int = 1000
+    sample_timesteps: Optional[int] = None
    cond_drop_prob: float = 0.
    loss_type: str = 'l2'
    predict_x_start: bool = True
@@ -127,23 +188,26 @@ class DiffusionPriorTrainConfig(BaseModel):
    use_ema: bool = True
    ema_beta: float = 0.99
    amp: bool = False
-    save_every: int = 10000 # what steps to save on
+    warmup_steps: int = None             # number of warmup steps
+    save_every_seconds: int = 3600       # how often to save
+    eval_timesteps: List[int] = [64]     # which sampling timesteps to evaluate with
+    best_validation_loss: float = 1e9    # the current best valudation loss observed
+    current_epoch: int = 0               # the current epoch
+    num_samples_seen: int = 0            # the current number of samples seen
+    random_seed: int = 0                 # manual seed for torch

 class DiffusionPriorDataConfig(BaseModel):
-    image_url: str     # path to embeddings folder
-    meta_url: str      # path to metadata (captions) for images
-    splits: TrainSplitConfig
-    batch_size: int = 64
-
-class DiffusionPriorLoadConfig(BaseModel):
-    source: str = None
-    resume: bool = False
+    image_url: str                   # path to embeddings folder
+    meta_url: str                    # path to metadata (captions) for images
+    splits: TrainSplitConfig         # define train, validation, test splits for your dataset
+    batch_size: int                  # per-gpu batch size used to train the model
+    num_data_points: int = 25e7      # total number of datapoints to train on
+    eval_every_seconds: int = 3600   # validation statistics will be performed this often

 class TrainDiffusionPriorConfig(BaseModel):
    prior: DiffusionPriorConfig
    data: DiffusionPriorDataConfig
    train: DiffusionPriorTrainConfig
-    load: DiffusionPriorLoadConfig
    tracker: TrackerConfig

    @classmethod
@@ -156,33 +220,46 @@ class TrainDiffusionPriorConfig(BaseModel):

 class UnetConfig(BaseModel):
    dim: int
-    dim_mults: ListOrTuple(int)
+    dim_mults: ListOrTuple[int]
    image_embed_dim: int = None
+    text_embed_dim: int = None
+    cond_on_text_encodings: bool = None
    cond_dim: int = None
    channels: int = 3
+    self_attn: ListOrTuple[int]
    attn_dim_head: int = 32
    attn_heads: int = 16
+    init_cross_embed: bool = True

    class Config:
        extra = "allow"

 class DecoderConfig(BaseModel):
-    unets: ListOrTuple(UnetConfig)
+    unets: ListOrTuple[UnetConfig]
    image_size: int = None
-    image_sizes: ListOrTuple(int) = None
+    image_sizes: ListOrTuple[int] = None
+    clip: Optional[AdapterConfig]   # The clip model to use if embeddings are not provided
    channels: int = 3
    timesteps: int = 1000
+    sample_timesteps: Optional[SingularOrIterable[int]] = None
    loss_type: str = 'l2'
-    beta_schedule: ListOrTuple(str) = 'cosine'
-    learned_variance: bool = True
+    beta_schedule: ListOrTuple[str] = None  # None means all cosine
+    learned_variance: SingularOrIterable[bool] = True
    image_cond_drop_prob: float = 0.1
    text_cond_drop_prob: float = 0.5

    def create(self):
        decoder_kwargs = self.dict()
+
        unet_configs = decoder_kwargs.pop('unets')
        unets = [Unet(**config) for config in unet_configs]
-        return Decoder(unets, **decoder_kwargs)
+
+        has_clip = exists(decoder_kwargs.pop('clip'))
+        clip = None
+        if has_clip:
+            clip = self.clip.create()
+
+        return Decoder(unets, clip=clip, **decoder_kwargs)

    @validator('image_sizes')
    def check_image_sizes(cls, image_sizes, values):
@@ -194,8 +271,9 @@ class DecoderConfig(BaseModel):
        extra = "allow"

 class DecoderDataConfig(BaseModel):
-    webdataset_base_url: str     # path to a webdataset with jpg images
-    embeddings_url: str          # path to .npy files with embeddings
+    webdataset_base_url: str               # path to a webdataset with jpg images
+    img_embeddings_url: Optional[str]      # path to .npy files with embeddings
+    text_embeddings_url: Optional[str]     # path to .npy files with embeddings
    num_workers: int = 4
    batch_size: int = 64
    start_shard: int = 0
@@ -225,21 +303,22 @@ class DecoderDataConfig(BaseModel):

 class DecoderTrainConfig(BaseModel):
    epochs: int = 20
-    lr: SingularOrIterable(float) = 1e-4
-    wd: SingularOrIterable(float) = 0.01
-    max_grad_norm: SingularOrIterable(float) = 0.5
+    lr: SingularOrIterable[float] = 1e-4
+    wd: SingularOrIterable[float] = 0.01
+    warmup_steps: Optional[SingularOrIterable[int]] = None
+    find_unused_parameters: bool = True
+    max_grad_norm: SingularOrIterable[float] = 0.5
    save_every_n_samples: int = 100000
    n_sample_images: int = 6                       # The number of example images to produce when sampling the train and test dataset
+    cond_scale: Union[float, List[float]] = 1.0
    device: str = 'cuda:0'
    epoch_samples: int = None                      # Limits the number of samples per epoch. None means no limit. Required if resample_train is true as otherwise the number of samples per epoch is infinite.
    validation_samples: int = None                 # Same as above but for validation.
+    save_immediately: bool = False
    use_ema: bool = True
    ema_beta: float = 0.999
    amp: bool = False
-    save_all: bool = False                         # Whether to preserve all checkpoints
-    save_latest: bool = True                       # Whether to always save the latest checkpoint
-    save_best: bool = True                         # Whether to save the best checkpoint
-    unet_training_mask: ListOrTuple(bool) = None   # If None, use all unets
+    unet_training_mask: ListOrTuple[bool] = None   # If None, use all unets

 class DecoderEvaluateConfig(BaseModel):
    n_evaluation_samples: int = 1000
@@ -248,19 +327,12 @@ class DecoderEvaluateConfig(BaseModel):
    KID: Dict[str, Any] = None
    LPIPS: Dict[str, Any] = None

-class DecoderLoadConfig(BaseModel):
-    source: str = None                      # Supports file and wandb
-    run_path: str = ''                      # Used only if source is wandb
-    file_path: str = ''                     # The local filepath if source is file. If source is wandb, the relative path to the model file in wandb.
-    resume: bool = False                    # If using wandb, whether to resume the run
-
 class TrainDecoderConfig(BaseModel):
    decoder: DecoderConfig
    data: DecoderDataConfig
    train: DecoderTrainConfig
    evaluate: DecoderEvaluateConfig
    tracker: TrackerConfig
-    load: DecoderLoadConfig
    seed: int = 0

    @classmethod
@@ -268,3 +340,32 @@ class TrainDecoderConfig(BaseModel):
        with open(json_path) as f:
            config = json.load(f)
        return cls(**config)
+    
+    @root_validator
+    def check_has_embeddings(cls, values):
+        # Makes sure that enough information is provided to get the embeddings specified for training
+        data_config, decoder_config = values.get('data'), values.get('decoder')
+
+        if not exists(data_config) or not exists(decoder_config):
+            # Then something else errored and we should just pass through
+            return values
+
+        using_text_embeddings = any([unet.cond_on_text_encodings for unet in decoder_config.unets])
+        using_clip = exists(decoder_config.clip)
+        img_emb_url = data_config.img_embeddings_url
+        text_emb_url = data_config.text_embeddings_url
+
+        if using_text_embeddings:
+            # Then we need some way to get the embeddings
+            assert using_clip or exists(text_emb_url), 'If text conditioning, either clip or text_embeddings_url must be provided'
+
+        if using_clip:
+            if using_text_embeddings:
+                assert not exists(text_emb_url) or not exists(img_emb_url), 'Loaded clip, but also provided text_embeddings_url and img_embeddings_url. This is redundant. Remove the clip model or the text embeddings'
+            else:
+                assert not exists(img_emb_url), 'Loaded clip, but also provided img_embeddings_url. This is redundant. Remove the clip model or the embeddings'
+
+        if text_emb_url:
+            assert using_text_embeddings, "Text embeddings are being loaded, but text embeddings are not being conditioned on. This will slow down the dataloader for no reason."
+
+        return values
--- a/dalle2_pytorch/trainer.py
+++ b/dalle2_pytorch/trainer.py
@@ -3,10 +3,13 @@ import copy
 from pathlib import Path
 from math import ceil
 from functools import partial, wraps
+from contextlib import nullcontext
 from collections.abc import Iterable

 import torch
+import torch.nn.functional as F
 from torch import nn
+from torch.optim.lr_scheduler import LambdaLR
 from torch.cuda.amp import autocast, GradScaler

 from dalle2_pytorch.dalle2_pytorch import Decoder, DiffusionPrior
@@ -14,9 +17,11 @@ from dalle2_pytorch.optimizer import get_optimizer
 from dalle2_pytorch.version import __version__
 from packaging import version

+import pytorch_warmup as warmup
+
 from ema_pytorch import EMA

-from accelerate import Accelerator
+from accelerate import Accelerator, DistributedType

 import numpy as np

@@ -71,6 +76,7 @@ def cast_torch_tensor(fn):
    def inner(model, *args, **kwargs):
        device = kwargs.pop('_device', next(model.parameters()).device)
        cast_device = kwargs.pop('_cast_device', True)
+        cast_deepspeed_precision = kwargs.pop('_cast_deepspeed_precision', True)

        kwargs_keys = kwargs.keys()
        all_args = (*args, *kwargs.values())
@@ -80,6 +86,21 @@ def cast_torch_tensor(fn):
        if cast_device:
            all_args = tuple(map(lambda t: t.to(device) if exists(t) and isinstance(t, torch.Tensor) else t, all_args))

+        if cast_deepspeed_precision:
+            try:
+                accelerator = model.accelerator
+                if accelerator is not None and accelerator.distributed_type == DistributedType.DEEPSPEED:
+                    cast_type_map = {
+                        "fp16": torch.half,
+                        "bf16": torch.bfloat16,
+                        "no": torch.float
+                    }
+                    precision_type = cast_type_map[accelerator.mixed_precision]
+                    all_args = tuple(map(lambda t: t.to(precision_type) if exists(t) and isinstance(t, torch.Tensor) else t, all_args))
+            except AttributeError:
+                # Then this model doesn't have an accelerator
+                pass
+
        args, kwargs_values = all_args[:split_kwargs_index], all_args[split_kwargs_index:]
        kwargs = dict(tuple(zip(kwargs_keys, kwargs_values)))

@@ -153,37 +174,57 @@ class DiffusionPriorTrainer(nn.Module):
    def __init__(
        self,
        diffusion_prior,
+        accelerator = None,
        use_ema = True,
        lr = 3e-4,
        wd = 1e-2,
        eps = 1e-6,
        max_grad_norm = None,
-        amp = False,
        group_wd_params = True,
-        device = None,
-        accelerator = None,
+        warmup_steps = 1,
        **kwargs
    ):
        super().__init__()
        assert isinstance(diffusion_prior, DiffusionPrior)
-        assert not exists(accelerator) or isinstance(accelerator, Accelerator)
-        assert exists(accelerator) or exists(device), "You must supply some method of obtaining a device."
+
        ema_kwargs, kwargs = groupby_prefix_and_trim('ema_', kwargs)
+        accelerator_kwargs, kwargs = groupby_prefix_and_trim('accelerator_', kwargs)
+
+        if not exists(accelerator):
+            accelerator = Accelerator(**accelerator_kwargs)

        # assign some helpful member vars
+
        self.accelerator = accelerator
-        self.device = accelerator.device if exists(accelerator) else device
        self.text_conditioned = diffusion_prior.condition_on_text_encodings

+        # setting the device
+
+        self.device = accelerator.device
+        diffusion_prior.to(self.device)
+
        # save model

        self.diffusion_prior = diffusion_prior

-        # optimizer and mixed precision stuff
+        # mixed precision checks

-        self.amp = amp
+        if (
+            exists(self.accelerator) 
+            and self.accelerator.distributed_type == DistributedType.DEEPSPEED 
+            and self.diffusion_prior.clip is not None
+            ):
+            # Then we need to make sure clip is using the correct precision or else deepspeed will error
+            cast_type_map = {
+                "fp16": torch.half,
+                "bf16": torch.bfloat16,
+                "no": torch.float
+            }
+            precision_type = cast_type_map[accelerator.mixed_precision]
+            assert precision_type == torch.float, "DeepSpeed currently only supports float32 precision when using on the fly embedding generation from clip"
+            self.diffusion_prior.clip.to(precision_type)

-        self.scaler = GradScaler(enabled = amp)
+        # optimizer stuff

        self.optim_kwargs = dict(lr=lr, wd=wd, eps=eps, group_wd_params=group_wd_params)

@@ -192,17 +233,21 @@ class DiffusionPriorTrainer(nn.Module):
            **self.optim_kwargs,
            **kwargs
        )
+        
+        self.scheduler = LambdaLR(self.optimizer, lr_lambda = lambda _: 1.0)
+        
+        self.warmup_scheduler = warmup.LinearWarmup(self.optimizer, warmup_period = warmup_steps) if exists(warmup_steps) else None

        # distribute the model if using HFA
-        if exists(self.accelerator):
-            self.diffusion_prior, self.optimizer = self.accelerator.prepare(self.diffusion_prior, self.optimizer)
+
+        self.diffusion_prior, self.optimizer, self.scheduler = self.accelerator.prepare(self.diffusion_prior, self.optimizer, self.scheduler)

        # exponential moving average stuff

        self.use_ema = use_ema

        if self.use_ema:
-            self.ema_diffusion_prior = EMA(self.unwrap_model(self.diffusion_prior), **ema_kwargs)
+            self.ema_diffusion_prior = EMA(self.accelerator.unwrap_model(self.diffusion_prior), **ema_kwargs)

        # gradient clipping if needed

@@ -210,66 +255,26 @@ class DiffusionPriorTrainer(nn.Module):

        # track steps internally

-        self.register_buffer('step', torch.tensor([0]))
-
-    # accelerator wrappers
-
-    def print(self, msg):
-        if exists(self.accelerator):
-            self.accelerator.print(msg)
-        else:
-            print(msg)
-
-    def unwrap_model(self, model):
-        if exists(self.accelerator):
-            return self.accelerator.unwrap_model(model)
-        else:
-            return model
-
-    def wait_for_everyone(self):
-        if exists(self.accelerator):
-            self.accelerator.wait_for_everyone()
-
-    def is_main_process(self):
-        if exists(self.accelerator):
-            return self.accelerator.is_main_process
-        else:
-            return True
-
-    def clip_grad_norm_(self, *args):
-        if exists(self.accelerator):
-            return self.accelerator.clip_grad_norm_(*args)
-        else:
-            return torch.nn.utils.clip_grad_norm_(*args)
-
-    def backprop(self, x):
-        if exists(self.accelerator):
-            self.accelerator.backward(x)
-        else:
-            try:
-                x.backward()
-            except Exception as e:
-                self.print(f"Caught error in backprop call: {e}")
+        self.register_buffer('step', torch.tensor([0], device = self.device))

    # utility

    def save(self, path, overwrite = True, **kwargs):
-        # ensure we sync gradients before continuing
-        self.wait_for_everyone()

        # only save on the main process
-        if self.is_main_process():
-            self.print(f"Saving checkpoint at step: {self.step.item()}")
+        if self.accelerator.is_main_process:
+            print(f"Saving checkpoint at step: {self.step.item()}")
            path = Path(path)
            assert not (path.exists() and not overwrite)
            path.parent.mkdir(parents = True, exist_ok = True)

+            # FIXME: LambdaLR can't be saved due to pickling issues
            save_obj = dict(
-                scaler = self.scaler.state_dict(),
                optimizer = self.optimizer.state_dict(),
-                model = self.unwrap_model(self.diffusion_prior).state_dict(), # unwrap the model from distribution if applicable
+                warmup_scheduler = self.warmup_scheduler,
+                model = self.accelerator.unwrap_model(self.diffusion_prior).state_dict(),
                version = version.parse(__version__),
-                step = self.step.item(),
+                step = self.step,
                **kwargs
            )

@@ -282,14 +287,14 @@ class DiffusionPriorTrainer(nn.Module):

            torch.save(save_obj, str(path))

-    def load(self, path, overwrite_lr = True, strict = True):
+    def load(self, path_or_state, overwrite_lr = True, strict = True):
        """
        Load a checkpoint of a diffusion prior trainer.

        Will load the entire trainer, including the optimizer and EMA.

        Params:
-            - path (str): a path to the DiffusionPriorTrainer checkpoint file
+            - path_or_state (str | torch): a path to the DiffusionPriorTrainer checkpoint file
            - overwrite_lr (bool): wether or not to overwrite the stored LR with the LR specified in the new trainer
            - strict (bool): kwarg for `torch.nn.Module.load_state_dict`, will force an exact checkpoint match

@@ -298,56 +303,56 @@ class DiffusionPriorTrainer(nn.Module):
        """

        # all processes need to load checkpoint. no restriction here
-        path = Path(path)
-        assert path.exists()
+        if isinstance(path_or_state, str):
+            path = Path(path_or_state)
+            assert path.exists()
+            loaded_obj = torch.load(str(path), map_location=self.device)

-        loaded_obj = torch.load(str(path), map_location=self.device)
+        elif isinstance(path_or_state, dict):
+            loaded_obj = path_or_state

        if version.parse(__version__) != loaded_obj['version']:
            print(f'loading saved diffusion prior at version {loaded_obj["version"]} but current package version is at {__version__}')

        # unwrap the model when loading from checkpoint
-        self.unwrap_model(self.diffusion_prior).load_state_dict(loaded_obj['model'], strict = strict)
-        self.step.copy_(torch.ones_like(self.step) * loaded_obj['step'])
-
-        self.scaler.load_state_dict(loaded_obj['scaler'])
+        self.accelerator.unwrap_model(self.diffusion_prior).load_state_dict(loaded_obj['model'], strict = strict)
+        self.step.copy_(torch.ones_like(self.step, device=self.device) * loaded_obj['step'].to(self.device))
        self.optimizer.load_state_dict(loaded_obj['optimizer'])

+        # set warmupstep
+        if exists(self.warmup_scheduler):
+            self.warmup_scheduler.last_step = self.step.item()
+
+        # ensure new lr is used if different from old one
        if overwrite_lr:
            new_lr = self.optim_kwargs["lr"]

-            self.print(f"Overriding LR to be {new_lr}")
-
            for group in self.optimizer.param_groups:
-                group["lr"] = new_lr
+                group["lr"] = new_lr if group["lr"] > 0.0 else 0.0

        if self.use_ema:
            assert 'ema' in loaded_obj
            self.ema_diffusion_prior.load_state_dict(loaded_obj['ema'], strict = strict)
-            # below not be necessary, but I had a suspicion that this wasn't being loaded correctly
+            # below might not be necessary, but I had a suspicion that this wasn't being loaded correctly
            self.ema_diffusion_prior.ema_model.load_state_dict(loaded_obj["ema_model"])

-        # sync and inform
-        self.wait_for_everyone()
-        self.print(f"Loaded model")
-
        return loaded_obj

    # model functionality

    def update(self):
-        # only continue with updates until all ranks finish
-        self.wait_for_everyone()

        if exists(self.max_grad_norm):
-            self.scaler.unscale_(self.optimizer)
-            # utilize HFA clipping where applicable
-            self.clip_grad_norm_(self.diffusion_prior.parameters(), self.max_grad_norm)
-
-        self.scaler.step(self.optimizer)
-        self.scaler.update()
+            self.accelerator.clip_grad_norm_(self.diffusion_prior.parameters(), self.max_grad_norm)
+        
+        self.optimizer.step()
        self.optimizer.zero_grad()

+        # accelerator will ocassionally skip optimizer steps in a "dynamic loss scaling strategy"
+        if not self.accelerator.optimizer_step_was_skipped:
+            with self.warmup_scheduler.dampening():
+                self.scheduler.step()
+
        if self.use_ema:
            self.ema_diffusion_prior.update()

@@ -376,7 +381,7 @@ class DiffusionPriorTrainer(nn.Module):
    @cast_torch_tensor
    @prior_sample_in_chunks
    def embed_text(self, *args, **kwargs):
-        return self.unwrap_model(self.diffusion_prior).clip.embed_text(*args, **kwargs)
+        return self.accelerator.unwrap_model(self.diffusion_prior).clip.embed_text(*args, **kwargs)

    @cast_torch_tensor
    def forward(
@@ -388,16 +393,14 @@ class DiffusionPriorTrainer(nn.Module):
        total_loss = 0.

        for chunk_size_frac, (chunked_args, chunked_kwargs) in split_args_and_kwargs(*args, split_size = max_batch_size, **kwargs):
-            with autocast(enabled = self.amp):
+            with self.accelerator.autocast():
                loss = self.diffusion_prior(*chunked_args, **chunked_kwargs)
                loss = loss * chunk_size_frac

            total_loss += loss.item()

-            # backprop with accelerate if applicable
-
            if self.training:
-                self.backprop(self.scaler.scale(loss))
+                self.accelerator.backward(loss)

        return total_loss

@@ -424,10 +427,12 @@ class DecoderTrainer(nn.Module):
        self,
        decoder,
        accelerator = None,
+        dataloaders = None,
        use_ema = True,
        lr = 1e-4,
        wd = 1e-2,
        eps = 1e-8,
+        warmup_steps = None,
        max_grad_norm = 0.5,
        amp = False,
        group_wd_params = True,
@@ -449,21 +454,36 @@ class DecoderTrainer(nn.Module):
        # be able to finely customize learning rate, weight decay
        # per unet

-        lr, wd, eps = map(partial(cast_tuple, length = self.num_unets), (lr, wd, eps))
+        lr, wd, eps, warmup_steps = map(partial(cast_tuple, length = self.num_unets), (lr, wd, eps, warmup_steps))
+
+        assert all([unet_lr <= 1e-2 for unet_lr in lr]), 'your learning rate is too high, recommend sticking with 1e-4, at most 5e-4'

        optimizers = []
+        schedulers = []
+        warmup_schedulers = []

-        for unet, unet_lr, unet_wd, unet_eps in zip(decoder.unets, lr, wd, eps):
-            optimizer = get_optimizer(
-                unet.parameters(),
-                lr = unet_lr,
-                wd = unet_wd,
-                eps = unet_eps,
-                group_wd_params = group_wd_params,
-                **kwargs
-            )
+        for unet, unet_lr, unet_wd, unet_eps, unet_warmup_steps in zip(decoder.unets, lr, wd, eps, warmup_steps):
+            if isinstance(unet, nn.Identity):
+                optimizers.append(None)
+                schedulers.append(None)
+                warmup_schedulers.append(None)
+            else:
+                optimizer = get_optimizer(
+                    unet.parameters(),
+                    lr = unet_lr,
+                    wd = unet_wd,
+                    eps = unet_eps,
+                    group_wd_params = group_wd_params,
+                    **kwargs
+                )

-            optimizers.append(optimizer)
+                optimizers.append(optimizer)
+                scheduler = LambdaLR(optimizer, lr_lambda = lambda step: 1.0)
+
+                warmup_scheduler = warmup.LinearWarmup(optimizer, warmup_period = unet_warmup_steps) if exists(unet_warmup_steps) else None
+                warmup_schedulers.append(warmup_scheduler)
+
+                schedulers.append(scheduler)

            if self.use_ema:
                self.ema_unets.append(EMA(unet, **ema_kwargs))
@@ -472,15 +492,58 @@ class DecoderTrainer(nn.Module):

        self.max_grad_norm = max_grad_norm

-        self.register_buffer('step', torch.tensor([0.]))
+        self.register_buffer('steps', torch.tensor([0] * self.num_unets))
+
+        if self.accelerator.distributed_type == DistributedType.DEEPSPEED and decoder.clip is not None:
+            # Then we need to make sure clip is using the correct precision or else deepspeed will error
+            cast_type_map = {
+                "fp16": torch.half,
+                "bf16": torch.bfloat16,
+                "no": torch.float
+            }
+            precision_type = cast_type_map[accelerator.mixed_precision]
+            assert precision_type == torch.float, "DeepSpeed currently only supports float32 precision when using on the fly embedding generation from clip"
+            clip = decoder.clip
+            clip.to(precision_type)

        decoder, *optimizers = list(self.accelerator.prepare(decoder, *optimizers))

        self.decoder = decoder

+        # prepare dataloaders
+
+        train_loader = val_loader = None
+        if exists(dataloaders):
+            train_loader, val_loader = self.accelerator.prepare(dataloaders["train"], dataloaders["val"])
+
+        self.train_loader = train_loader
+        self.val_loader = val_loader
+
+        # store optimizers
+
        for opt_ind, optimizer in zip(range(len(optimizers)), optimizers):
            setattr(self, f'optim{opt_ind}', optimizer)

+        # store schedulers
+
+        for sched_ind, scheduler in zip(range(len(schedulers)), schedulers):
+            setattr(self, f'sched{sched_ind}', scheduler)
+
+        # store warmup schedulers
+
+        self.warmup_schedulers = warmup_schedulers
+
+    def validate_and_return_unet_number(self, unet_number = None):
+        if self.num_unets == 1:
+            unet_number = default(unet_number, 1)
+
+        assert exists(unet_number) and 1 <= unet_number <= self.num_unets
+        return unet_number
+
+    def num_steps_taken(self, unet_number = None):
+        unet_number = self.validate_and_return_unet_number(unet_number)
+        return self.steps[unet_number - 1].item()
+
    def save(self, path, overwrite = True, **kwargs):
        path = Path(path)
        assert not (path.exists() and not overwrite)
@@ -489,44 +552,53 @@ class DecoderTrainer(nn.Module):
        save_obj = dict(
            model = self.accelerator.unwrap_model(self.decoder).state_dict(),
            version = __version__,
-            step = self.step.item(),
+            steps = self.steps.cpu(),
            **kwargs
        )

        for ind in range(0, self.num_unets):
            optimizer_key = f'optim{ind}'
            optimizer = getattr(self, optimizer_key)
-            save_obj = {**save_obj, optimizer_key: self.accelerator.unwrap_model(optimizer).state_dict()}
+            state_dict = optimizer.state_dict() if optimizer is not None else None
+            save_obj = {**save_obj, optimizer_key: state_dict}

        if self.use_ema:
            save_obj = {**save_obj, 'ema': self.ema_unets.state_dict()}

        self.accelerator.save(save_obj, str(path))

+    def load_state_dict(self, loaded_obj, only_model = False, strict = True):
+        if version.parse(__version__) != version.parse(loaded_obj['version']):
+            self.accelerator.print(f'loading saved decoder at version {loaded_obj["version"]}, but current package version is {__version__}')
+
+        self.accelerator.unwrap_model(self.decoder).load_state_dict(loaded_obj['model'], strict = strict)
+        self.steps.copy_(loaded_obj['steps'])
+
+        if only_model:
+            return loaded_obj
+
+        for ind, last_step in zip(range(0, self.num_unets), self.steps.tolist()):
+
+            optimizer_key = f'optim{ind}'
+            optimizer = getattr(self, optimizer_key)
+            warmup_scheduler = self.warmup_schedulers[ind]
+            if optimizer is not None:
+                optimizer.load_state_dict(loaded_obj[optimizer_key])
+
+            if exists(warmup_scheduler):
+                warmup_scheduler.last_step = last_step
+
+        if self.use_ema:
+            assert 'ema' in loaded_obj
+            self.ema_unets.load_state_dict(loaded_obj['ema'], strict = strict)
+
    def load(self, path, only_model = False, strict = True):
        path = Path(path)
        assert path.exists()

        loaded_obj = torch.load(str(path), map_location = 'cpu')

-        if version.parse(__version__) != version.parse(loaded_obj['version']):
-            self.accelerator.print(f'loading saved decoder at version {loaded_obj["version"]}, but current package version is {__version__}')
-
-        self.accelerator.unwrap_model(self.decoder).load_state_dict(loaded_obj['model'], strict = strict)
-        self.step.copy_(torch.ones_like(self.step) * loaded_obj['step'])
-
-        if only_model:
-            return loaded_obj
-
-        for ind in range(0, self.num_unets):
-            optimizer_key = f'optim{ind}'
-            optimizer = getattr(self, optimizer_key)
-
-            self.accelerator.unwrap_model(optimizer).load_state_dict(loaded_obj[optimizer_key])
-
-        if self.use_ema:
-            assert 'ema' in loaded_obj
-            self.ema_unets.load_state_dict(loaded_obj['ema'], strict = strict)
+        self.load_state_dict(loaded_obj, only_model = only_model, strict = strict)

        return loaded_obj

@@ -534,25 +606,36 @@ class DecoderTrainer(nn.Module):
    def unets(self):
        return nn.ModuleList([ema.ema_model for ema in self.ema_unets])

-    def update(self, unet_number = None):
-        if self.num_unets == 1:
-            unet_number = default(unet_number, 1)
+    def increment_step(self, unet_number):
+        assert 1 <= unet_number <= self.num_unets

-        assert exists(unet_number) and 1 <= unet_number <= self.num_unets
+        unet_index_tensor = torch.tensor(unet_number - 1, device = self.steps.device)
+        self.steps += F.one_hot(unet_index_tensor, num_classes = len(self.steps))
+
+    def update(self, unet_number = None):
+        unet_number = self.validate_and_return_unet_number(unet_number)
        index = unet_number - 1

        optimizer = getattr(self, f'optim{index}')
+        scheduler = getattr(self, f'sched{index}')

        if exists(self.max_grad_norm):
            self.accelerator.clip_grad_norm_(self.decoder.parameters(), self.max_grad_norm)  # Automatically unscales gradients
+
        optimizer.step()
        optimizer.zero_grad()

+        warmup_scheduler = self.warmup_schedulers[index]
+        scheduler_context = warmup_scheduler.dampening if exists(warmup_scheduler) else nullcontext
+
+        with scheduler_context():
+            scheduler.step()
+
        if self.use_ema:
            ema_unet = self.ema_unets[index]
            ema_unet.update()

-        self.step += 1
+        self.increment_step(unet_number)

    @torch.no_grad()
    @cast_torch_tensor
@@ -560,8 +643,14 @@ class DecoderTrainer(nn.Module):
    def sample(self, *args, **kwargs):
        distributed = self.accelerator.num_processes > 1
        base_decoder = self.accelerator.unwrap_model(self.decoder)
+
+        was_training = base_decoder.training
+        base_decoder.eval()
+
        if kwargs.pop('use_non_ema', False) or not self.use_ema:
-            return base_decoder.sample(*args, **kwargs, distributed = distributed)
+            out = base_decoder.sample(*args, **kwargs, distributed = distributed)
+            base_decoder.train(was_training)
+            return out

        trainable_unets = self.accelerator.unwrap_model(self.decoder).unets
        base_decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling
@@ -574,30 +663,53 @@ class DecoderTrainer(nn.Module):
        for ema in self.ema_unets:
            ema.restore_ema_model_device()

+        base_decoder.train(was_training)
        return output

+    @torch.no_grad()
+    @cast_torch_tensor
+    @prior_sample_in_chunks
+    def embed_text(self, *args, **kwargs):
+        return self.accelerator.unwrap_model(self.decoder).clip.embed_text(*args, **kwargs)
+
+    @torch.no_grad()
+    @cast_torch_tensor
+    @prior_sample_in_chunks
+    def embed_image(self, *args, **kwargs):
+        return self.accelerator.unwrap_model(self.decoder).clip.embed_image(*args, **kwargs)
+
    @cast_torch_tensor
    def forward(
        self,
        *args,
        unet_number = None,
        max_batch_size = None,
+        return_lowres_cond_image=False,
        **kwargs
    ):
-        if self.num_unets == 1:
-            unet_number = default(unet_number, 1)
+        unet_number = self.validate_and_return_unet_number(unet_number)

        total_loss = 0.
-
+        cond_images = []
        for chunk_size_frac, (chunked_args, chunked_kwargs) in split_args_and_kwargs(*args, split_size = max_batch_size, **kwargs):
-            # with autocast(enabled = self.amp):
            with self.accelerator.autocast():
-                loss = self.decoder(*chunked_args, unet_number = unet_number, **chunked_kwargs)
+                loss_obj = self.decoder(*chunked_args, unet_number = unet_number, return_lowres_cond_image=return_lowres_cond_image, **chunked_kwargs)
+                # loss_obj may be a tuple with loss and cond_image
+                if return_lowres_cond_image:
+                    loss, cond_image = loss_obj
+                else:
+                    loss = loss_obj
+                    cond_image = None
                loss = loss * chunk_size_frac
+                if cond_image is not None:
+                    cond_images.append(cond_image)

            total_loss += loss.item()

            if self.training:
                self.accelerator.backward(loss)

-        return total_loss
+        if return_lowres_cond_image:
+            return total_loss, torch.stack(cond_images)
+        else:
+            return total_loss
--- a/dalle2_pytorch/utils.py
+++ b/dalle2_pytorch/utils.py
@@ -1,6 +1,11 @@
 import time
 import importlib

+# helper functions
+
+def exists(val):
+    return val is not None
+
 # time helpers

 class Timer:
--- a/dalle2_pytorch/version.py
+++ b/dalle2_pytorch/version.py
@@ -1 +1 @@
-__version__ = '0.11.4'
+__version__ = '1.6.2'
--- a/prior.md
+++ b/prior.md
@@ -0,0 +1,183 @@
+# Diffusion Prior
+This readme serves as an introduction to the diffusion prior.
+
+## Intro
+
+A properly trained prior will allow you to translate between two embedding spaces. If you know *a priori* that two embeddings are connected some way—then ability the translate between them could extremely helpful.
+
+### Motivation
+
+Before we dive into the model, let’s look at a quick example of where the model may be helpful.
+
+For demonstration purposes we will imagine that we wish to generate images from text using CLIP and a Decoder.
+
+> [CLIP](https://openai.com/blog/clip/) is a contrastive model that learns to maximize the cosine similarity between a given image and caption, however, there is no guarantee that these embeddings are in the same space. While the embeddings generated are ***close*** the image and text embeddings occupy two disjoint sets.
+
+```python
+# Load Models
+clip_model = clip.load("ViT-L/14")
+decoder = Decoder(checkpoint="best.pth") # A decoder trained on CLIP Image embeddings
+
+# Retrieve prompt from user and encode with CLIP
+prompt = "A corgi wearing sunglasses"
+tokenized_text = tokenize(prompt)
+text_embedding = clip_model.encode_text(tokenized_text)
+
+# Now, pass the text embedding to the decoder
+predicted_image = decoder.sample(text_embedding)
+```
+
+> **Question**: *Can you spot the issue here?*
+>
+> **Answer**: *We’re trying to generate an image from a text embedding!*
+
+Unfortunately, we run into the issue previously mentioned--the image embeddings and the text embeddings are not interchangeable! Now let's look at a better solution
+
+```python
+# Load Models
+prior= Prior(checkpoint="prior.pth") # A decoder trained to go from: text-> clip text emb -> clip img emb
+decoder = Decoder(checkpoint="decoder.pth") # A decoder trained on CLIP Image embeddings
+
+# Retrieve prompt from user and encode with a prior
+prompt = "A corgi wearing sunglasses"
+tokenized_text = tokenize(prompt)
+text_embedding = prior.sample(tokenized_text) # <-- now we get an embedding in the same space as images!
+
+# Now, pass the predicted image embedding to the decoder
+predicted_image = decoder.sample(text_embedding)
+```
+
+With the prior we are able to successfully generate embeddings *within* CLIP's image space! For this reason, the decoder will perform much better as it receives input that is much closer to its training data.
+
+> **You may be asking yourself the following question:**
+>
+> *"Why don't you just train the decoder on clip text embeddings instead of image embeddings?"*
+>
+> OpenAI covers this topic in their [DALLE-2 paper](https://arxiv.org/abs/2204.06125). The TL;DR is *"it doesn't work as well as decoders trained on image embeddings"*...also...its just an example :smile:
+
+## Usage
+
+To utilize a pre-trained prior, it’s quite simple.
+
+### Loading Checkpoints
+```python
+import torch
+from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork, OpenAIClipAdapter
+from dalle2_pytorch.trainer import DiffusionPriorTrainer
+
+def load_diffusion_model(dprior_path):
+
+    prior_network = DiffusionPriorNetwork(
+        dim=768,
+        depth=24,
+        dim_head=64,
+        heads=32,
+        normformer=True,
+        attn_dropout=5e-2,
+        ff_dropout=5e-2,
+        num_time_embeds=1,
+        num_image_embeds=1,
+        num_text_embeds=1,
+        num_timesteps=1000,
+        ff_mult=4
+    )
+
+    diffusion_prior = DiffusionPrior(
+        net=prior_network,
+        clip=OpenAIClipAdapter("ViT-L/14"),
+        image_embed_dim=768,
+        timesteps=1000,
+        cond_drop_prob=0.1,
+        loss_type="l2",
+        condition_on_text_encodings=True,
+
+    )
+
+    trainer = DiffusionPriorTrainer(
+        diffusion_prior=diffusion_prior,
+        lr=1.1e-4,
+        wd=6.02e-2,
+        max_grad_norm=0.5,
+        amp=False,
+        group_wd_params=True,
+        use_ema=True,
+        device=device,
+        accelerator=None,
+    )
+
+    trainer.load(dprior_path)
+
+    return trainer
+```
+
+ Here we instantiate a model matches the configuration it was trained with, and then load the weights (*just like any other PyTorch model!*)
+
+### Sampling
+Once we have a pre-trained model, generating embeddings is quite simple!
+```python
+# tokenize the text
+tokenized_text = clip.tokenize("<your amazing prompt>")
+# predict an embedding
+predicted_embedding = prior.sample(tokenized_text, n_samples_per_batch=2, cond_scale=1.0)
+```
+
+The resulting tensor returned from `.sample()` is of the same shape as your training data along the non-batch dimension(s). For example, a prior trained on `ViT-L/14` embeddings will predict an embedding of shape (1, 768).
+
+> For CLIP priors, this is quite handy as it means that you can use prior.sample(tokenizer_text) as a drop in replacement for clip.encode_text().
+
+**Some things to note:**
+* It is possible to specify the number of embeddings to sample from (the default suggested by OpenAI is `n=2`). Put simply, the idea here is that you avoid getting unlucky with a bad embedding generation by creating two; and selecting the one with the higher cosine similarity with the prompt.
+* You may specify a higher conditioning scale than the default (`1.0`). It is unclear whether OpenAI uses a higher value for the prior specifically, or only on the decoder. Local testing has shown poor results with anything higher than `1.0` but *ymmv*.
+
+---
+
+## Training
+
+### Overview
+
+Training the prior is a relatively straightforward process thanks to the Trainer base class. The major step that is required of you is preparing a dataset in the format that EmbeddingReader expects. Having pre-computed embeddings massively increases training efficiency and is generally recommended as you will likely benefit from having them on hand for other tasks as well. Once you have a dataset, you are ready to move onto configuration
+
+## Dataset
+
+To train the prior, it is highly recommended to use precomputed embeddings for the images. To obtain these for a custom dataset, you can leverage [img2datset](https://github.com/rom1504/img2dataset) to pull images from a list of URLs and [clip_retrieval](https://github.com/rom1504/clip-retrieval#clip-inference) for generating the actual embeddings that can be used in the prior's dataloader.
+
+## Configuration
+
+The configuration file allows for you to easily track and reproduce experiments. It is a simple JSON file that will specify the architecture, dataset, and training parameters. For more information and specifics please see the configuration README.
+
+## Distributed Training
+
+If you would like to train in a distributed manner we have opted to leverage huggingface’ new Accelerate library. HFA makes it extremely simple to distribute work across multiple GPU’s and nodes. All that is required of you is to follow the simple CLI configuration tool [more information here](https://huggingface.co/docs/accelerate/accelerator).
+
+## Evaluation
+
+There are a variety of metrics available to you when training the prior. You can read a brief description of each in the table below:
+| Metric                              | Description                                                                                                                                                                                                                                                  | Comments                                                                                                                                                                                                                                                                                                                                                |
+| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Online Model Validation             | The validation loss associated with your online model.                                                                                                                                                                                                       | Ideally validation loss will be as low as possible. Using L2 loss, values as low as `0.1` and lower are possible after around 1 Billion samples seen.                                                                                                                                                                                                |
+| EMA Validation                      | This metric measures the validation loss associated with your EMA model.                                                                                                                                                                                     | This will likely lag behind your "online" model's validation loss, but should outperform in the long-term.                                                                                                                                                                                                                                              |
+| Baseline Similarity                 | Baseline similarity refers to the similarity between your dataset's prompts and associated image embeddings. This will serve as a guide for your prior's performance in cosine similarity.                                                                    | Generally `0.3` is considered a good cosine similarity for caption similarity.                                                                                                                                                                                                                                                                         |
+| Similarity With Original Image      | This metric will measure the cosine similarity between your prior's predicted image embedding and the actual image that the caption was associated with. This is useful for determining wether your prior is generating images with the right contents.      | Values around `0.75`+ are obtainable. This metric should improve rapidly in the early stages of training and plateau with diminishing increases over time. If it takes hundreds of millions of samples to reach above `0.5`/`0.6` similarity--then you likely are suffering from some kind of training error or inefficiency (i.e. not using EMA) |
+| Difference From Baseline Similarity | Sometimes its useful to visualize a metric in another light. This metric will show you how your prior's predicted image embeddings match up with the baseline similarity measured in your dataset.                                                           | This value should float around `0.0` with some room for variation. After a billion samples seen, values are within `0.01`+/- of `0.0`. If this climbs to high, (~>`0.02`) then this may be a sign that your model is overfitting somehow.                                                                                                       |
+| Similarity With Text                | This metric is your bread and butter cosine similarity between the predicted image embedding and the original caption given to the prior. Monitoring this metric will be on of your main focuses and is probably the second most important behind your loss. | As mentioned, this value should be close to baseline similarity. We have observed early rapid increase with diminishing returns as the prior learns to generate valid image embeddings. If this value increases too far beyond the baseline similarity--it could be an indication that your model is overfitting.                                       |
+| Similarity With Unrelated Caption   | This metric will attempt to exposed an overfit prior by feeding it arbitrary prompts (from your dataset) and then measure the similarity of this predicted embedding with some other image.                                                                   | Early on we found that a poorly trained/modeled prior could effectively fool CLIP into believing that the cosine similarity between two images were high (when in fact the caption and image were completely unrelated). With this in mind--a low value is ideal, anything below `0.1` is probably safe.                                              |
+
+## Launching the script
+
+Now that you’ve done all the prep it’s time for the easy part! 🚀
+
+To actually launch the script, you will either use `accelerate launch train_diffusion_prior.py --config_path <path to your config>` to launch with distributed training & huggingface accelerate or `python train_diffusion_prior.py` if you would like to train on your gpu/cpu without huggingface accelerate.
+
+## Checkpointing
+
+Checkpoints will be saved to the directory specified in your configuration file.
+
+Additionally, a final checkpoint is saved before running the test split. This file will be saved to the same directory and titled “latest.pth”. This is to avoid problems where your `save_every` configuration does not overlap with the number of steps required to do a complete pass through the data.
+
+## Things To Keep In Mind
+
+The prior has not been trained for tasks other than the traditional CLIP embedding translation…at least yet.
+
+As we finalize the replication of unCLIP, there will almost assuredly be experiments attempting to apply the prior network to other tasks.
+
+With that in mind, you are more or less a pioneer in embedding-translation if you are reading this and attempting something you don’t see documentation for!
--- a/setup.py
+++ b/setup.py
@@ -26,7 +26,7 @@ setup(
  install_requires=[
    'accelerate',
    'click',
-    'clip-anytorch',
+    'clip-anytorch>=2.4.0',
    'coca-pytorch>=0.0.5',
    'ema-pytorch>=0.0.7',
    'einops>=0.4',
@@ -37,6 +37,7 @@ setup(
    'packaging',
    'pillow',
    'pydantic',
+    'pytorch-warmup',
    'resize-right>=0.0.2',
    'rotary-embedding-torch',
    'torch>=1.10',
--- a/test_data/0.tar
+++ b/test_data/0.tar
--- a/test_data/1.tar
+++ b/test_data/1.tar
--- a/test_data/2.tar
+++ b/test_data/2.tar
--- a/test_data/3.tar
+++ b/test_data/3.tar
--- a/test_data/4.tar
+++ b/test_data/4.tar
--- a/test_data/5.tar
+++ b/test_data/5.tar
--- a/test_data/6.tar
+++ b/test_data/6.tar
--- a/test_data/7.tar
+++ b/test_data/7.tar
--- a/test_data/8.tar
+++ b/test_data/8.tar
--- a/test_data/9.tar
+++ b/test_data/9.tar
--- a/train_decoder.py
+++ b/train_decoder.py
@@ -1,19 +1,23 @@
 from pathlib import Path
+from typing import List
+from datetime import timedelta

 from dalle2_pytorch.trainer import DecoderTrainer
 from dalle2_pytorch.dataloaders import create_image_embedding_dataloader
-from dalle2_pytorch.trackers import WandbTracker, ConsoleTracker, DummyTracker
-from dalle2_pytorch.train_configs import TrainDecoderConfig
+from dalle2_pytorch.trackers import Tracker
+from dalle2_pytorch.train_configs import DecoderConfig, TrainDecoderConfig
 from dalle2_pytorch.utils import Timer, print_ribbon
-from dalle2_pytorch.dalle2_pytorch import resize_image_to
+from dalle2_pytorch.dalle2_pytorch import Decoder, resize_image_to
+from clip import tokenize

 import torchvision
 import torch
+from torch import nn
 from torchmetrics.image.fid import FrechetInceptionDistance
 from torchmetrics.image.inception import InceptionScore
 from torchmetrics.image.kid import KernelInceptionDistance
 from torchmetrics.image.lpip import LearnedPerceptualImagePatchSimilarity
-from accelerate import Accelerator, DistributedDataParallelKwargs
+from accelerate import Accelerator, DistributedDataParallelKwargs, InitProcessGroupKwargs
 from accelerate.utils import dataclasses as accelerate_dataclasses
 import webdataset as wds
 import click
@@ -33,7 +37,8 @@ def exists(val):
 def create_dataloaders(
    available_shards,
    webdataset_base_url,
-    embeddings_url,
+    img_embeddings_url=None,
+    text_embeddings_url=None,
    shard_width=6,
    num_workers=4,
    batch_size=32,
@@ -63,14 +68,15 @@ def create_dataloaders(
    test_urls = [webdataset_base_url.format(str(shard).zfill(shard_width)) for shard in test_split]
    val_urls = [webdataset_base_url.format(str(shard).zfill(shard_width)) for shard in val_split]
    
-    create_dataloader = lambda tar_urls, shuffle=False, resample=False, with_text=False, for_sampling=False: create_image_embedding_dataloader(
+    create_dataloader = lambda tar_urls, shuffle=False, resample=False, for_sampling=False: create_image_embedding_dataloader(
        tar_url=tar_urls,
        num_workers=num_workers,
        batch_size=batch_size if not for_sampling else n_sample_images,
-        embeddings_url=embeddings_url,
+        img_embeddings_url=img_embeddings_url,
+        text_embeddings_url=text_embeddings_url,
        index_width=index_width,
        shuffle_num = None,
-        extra_keys= ["txt"] if with_text else [],
+        extra_keys= ["txt"],
        shuffle_shards = shuffle,
        resample_shards = resample, 
        img_preproc=img_preproc,
@@ -79,8 +85,8 @@ def create_dataloaders(

    train_dataloader = create_dataloader(train_urls, shuffle=shuffle_train, resample=resample_train)
    train_sampling_dataloader = create_dataloader(train_urls, shuffle=False, for_sampling=True)
-    val_dataloader = create_dataloader(val_urls, shuffle=False, with_text=True)
-    test_dataloader = create_dataloader(test_urls, shuffle=False, with_text=True)
+    val_dataloader = create_dataloader(val_urls, shuffle=False)
+    test_dataloader = create_dataloader(test_urls, shuffle=False)
    test_sampling_dataloader = create_dataloader(test_urls, shuffle=False, for_sampling=True)
    return {
        "train": train_dataloader,
@@ -104,54 +110,79 @@ def get_example_data(dataloader, device, n=5):
    Samples the dataloader and returns a zipped list of examples
    """
    images = []
-    embeddings = []
+    img_embeddings = []
+    text_embeddings = []
    captions = []
-    dataset_keys = get_dataset_keys(dataloader)
-    has_caption = "txt" in dataset_keys
-    for data in dataloader:
-        if has_caption:
-            img, emb, txt = data
+    for img, emb, txt in dataloader:
+        img_emb, text_emb = emb.get('img'), emb.get('text')
+        if img_emb is not None:
+            img_emb = img_emb.to(device=device, dtype=torch.float)
+            img_embeddings.extend(list(img_emb))
        else:
-            img, emb = data
-            txt = [""] * emb.shape[0]
+            # Then we add None img.shape[0] times
+            img_embeddings.extend([None]*img.shape[0])
+        if text_emb is not None:
+            text_emb = text_emb.to(device=device, dtype=torch.float)
+            text_embeddings.extend(list(text_emb))
+        else:
+            # Then we add None img.shape[0] times
+            text_embeddings.extend([None]*img.shape[0])
        img = img.to(device=device, dtype=torch.float)
-        emb = emb.to(device=device, dtype=torch.float)
        images.extend(list(img))
-        embeddings.extend(list(emb))
        captions.extend(list(txt))
        if len(images) >= n:
            break
-    return list(zip(images[:n], embeddings[:n], captions[:n]))
+    return list(zip(images[:n], img_embeddings[:n], text_embeddings[:n], captions[:n]))

-def generate_samples(trainer, example_data, text_prepend=""):
+def generate_samples(trainer, example_data, start_unet=1, end_unet=None, condition_on_text_encodings=False, cond_scale=1.0, device=None, text_prepend="", match_image_size=True):
    """
    Takes example data and generates images from the embeddings
    Returns three lists: real images, generated images, and captions
    """
-    real_images, embeddings, txts = zip(*example_data)
-    embeddings_tensor = torch.stack(embeddings)
-    samples = trainer.sample(embeddings_tensor)
+    real_images, img_embeddings, text_embeddings, txts = zip(*example_data)
+    sample_params = {}
+    if img_embeddings[0] is None:
+        # Generate image embeddings from clip
+        imgs_tensor = torch.stack(real_images)
+        img_embeddings, *_ = trainer.embed_image(imgs_tensor)
+        sample_params["image_embed"] = img_embeddings
+    else:
+        # Then we are using precomputed image embeddings
+        img_embeddings = torch.stack(img_embeddings)
+        sample_params["image_embed"] = img_embeddings
+    if condition_on_text_encodings:
+        if text_embeddings[0] is None:
+            # Generate text embeddings from text
+            tokenized_texts = tokenize(txts, truncate=True)
+            sample_params["text"] = tokenized_texts
+        else:
+            # Then we are using precomputed text embeddings
+            text_embeddings = torch.stack(text_embeddings)
+            sample_params["text_encodings"] = text_embeddings
+    sample_params["start_at_unet_number"] = start_unet
+    sample_params["stop_at_unet_number"] = end_unet
+    if start_unet > 1:
+        # If we are only training upsamplers
+        sample_params["image"] = torch.stack(real_images)
+    if device is not None:
+        sample_params["_device"] = device
+    samples = trainer.sample(**sample_params)
    generated_images = list(samples)
    captions = [text_prepend + txt for txt in txts]
+    if match_image_size:
+        generated_image_size = generated_images[0].shape[-1]
+        real_images = [resize_image_to(image, generated_image_size, clamp_range=(0, 1)) for image in real_images]
    return real_images, generated_images, captions

-def generate_grid_samples(trainer, examples, text_prepend=""):
+def generate_grid_samples(trainer, examples, start_unet=1, end_unet=None, condition_on_text_encodings=False, cond_scale=1.0, device=None, text_prepend=""):
    """
    Generates samples and uses torchvision to put them in a side by side grid for easy viewing
    """
-    real_images, generated_images, captions = generate_samples(trainer, examples, text_prepend)
-
-    real_image_size = real_images[0].shape[-1]
-    generated_image_size = generated_images[0].shape[-1]
-
-    # training images may be larger than the generated one
-    if real_image_size > generated_image_size:
-        real_images = [resize_image_to(image, generated_image_size) for image in real_images]
-
+    real_images, generated_images, captions = generate_samples(trainer, examples, start_unet, end_unet, condition_on_text_encodings, cond_scale, device, text_prepend)
    grid_images = [torchvision.utils.make_grid([original_image, generated_image]) for original_image, generated_image in zip(real_images, generated_images)]
    return grid_images, captions
                    
-def evaluate_trainer(trainer, dataloader, device, n_evaluation_samples=1000, FID=None, IS=None, KID=None, LPIPS=None):
+def evaluate_trainer(trainer, dataloader, device,  start_unet, end_unet, condition_on_text_encodings=False, cond_scale=1.0, inference_device=None, n_evaluation_samples=1000, FID=None, IS=None, KID=None, LPIPS=None):
    """
    Computes evaluation metrics for the decoder
    """
@@ -161,7 +192,7 @@ def evaluate_trainer(trainer, dataloader, device, n_evaluation_samples=1000, FID
    if len(examples) == 0:
        print("No data to evaluate. Check that your dataloader has shards.")
        return metrics
-    real_images, generated_images, captions = generate_samples(trainer, examples)
+    real_images, generated_images, captions = generate_samples(trainer, examples, start_unet, end_unet, condition_on_text_encodings, cond_scale, inference_device)
    real_images = torch.stack(real_images).to(device=device, dtype=torch.float)
    generated_images = torch.stack(generated_images).to(device=device, dtype=torch.float)
    # Convert from [0, 1] to [0, 255] and from torch.float to torch.uint8
@@ -213,43 +244,37 @@ def evaluate_trainer(trainer, dataloader, device, n_evaluation_samples=1000, FID
            metrics[metric_name] = metrics_tensor[i].item()
    return metrics

-def save_trainer(tracker, trainer, epoch, sample, next_task, validation_losses, relative_paths):
+def save_trainer(tracker: Tracker, trainer: DecoderTrainer, epoch: int, sample: int, next_task: str, validation_losses: List[float], samples_seen: int, is_latest=True, is_best=False):
    """
    Logs the model with an appropriate method depending on the tracker
    """
-    if isinstance(relative_paths, str):
-        relative_paths = [relative_paths]
-    for relative_path in relative_paths:
-        local_path = str(tracker.data_path / relative_path)
-        trainer.save(local_path, epoch=epoch, sample=sample, next_task=next_task, validation_losses=validation_losses)
-        tracker.save_file(local_path)
+    tracker.save(trainer, is_best=is_best, is_latest=is_latest, epoch=epoch, sample=sample, next_task=next_task, validation_losses=validation_losses, samples_seen=samples_seen)
    
-def recall_trainer(tracker, trainer, recall_source=None, **load_config):
+def recall_trainer(tracker: Tracker, trainer: DecoderTrainer):
    """
    Loads the model with an appropriate method depending on the tracker
    """
-    trainer.accelerator.print(print_ribbon(f"Loading model from {recall_source}"))
-    local_filepath = tracker.recall_file(recall_source, **load_config)
-    state_dict = trainer.load(local_filepath)
-    return state_dict.get("epoch", 0), state_dict.get("validation_losses", []), state_dict.get("next_task", "train"), state_dict.get("sample", 0)
+    trainer.accelerator.print(print_ribbon(f"Loading model from {type(tracker.loader).__name__}"))
+    state_dict = tracker.recall()
+    trainer.load_state_dict(state_dict, only_model=False, strict=True)
+    return state_dict.get("epoch", 0), state_dict.get("validation_losses", []), state_dict.get("next_task", "train"), state_dict.get("sample", 0), state_dict.get("samples_seen", 0)

 def train(
    dataloaders,
-    decoder,
-    accelerator,
-    tracker,
+    decoder: Decoder,
+    accelerator: Accelerator,
+    tracker: Tracker,
    inference_device,
-    load_config=None,
    evaluate_config=None,
    epoch_samples = None,  # If the training dataset is resampling, we have to manually stop an epoch
    validation_samples = None,
+    save_immediately=False,
    epochs = 20,
    n_sample_images = 5,
    save_every_n_samples = 100000,
-    save_all=False,
-    save_latest=True,
-    save_best=True,
    unet_training_mask=None,
+    condition_on_text_encodings=False,
+    cond_scale=1.0,
    **kwargs
 ):
    """
@@ -257,9 +282,25 @@ def train(
    """
    is_master = accelerator.process_index == 0

+    if not exists(unet_training_mask):
+        # Then the unet mask should be true for all unets in the decoder
+        unet_training_mask = [True] * len(decoder.unets)
+    assert len(unet_training_mask) == len(decoder.unets), f"The unet training mask should be the same length as the number of unets in the decoder. Got {len(unet_training_mask)} and {trainer.num_unets}"
+    trainable_unet_numbers = [i+1 for i, trainable in enumerate(unet_training_mask) if trainable]
+    first_trainable_unet = trainable_unet_numbers[0]
+    last_trainable_unet = trainable_unet_numbers[-1]
+    def move_unets(unet_training_mask):
+        for i in range(len(decoder.unets)):
+            if not unet_training_mask[i]:
+                # Replace the unet from the module list with a nn.Identity(). This training script never uses unets that aren't being trained so this is fine.
+                decoder.unets[i] = nn.Identity().to(inference_device)
+    # Remove non-trainable unets
+    move_unets(unet_training_mask)
+
    trainer = DecoderTrainer(
-        accelerator,
-        decoder,
+        decoder=decoder,
+        accelerator=accelerator,
+        dataloaders=dataloaders,
        **kwargs
    )

@@ -268,24 +309,20 @@ def train(
    validation_losses = []
    next_task = 'train'
    sample = 0
+    samples_seen = 0
    val_sample = 0
-    step = lambda: int(trainer.step.item())
+    step = lambda: int(trainer.num_steps_taken(unet_number=first_trainable_unet))

-    if exists(load_config) and exists(load_config.source):
-        start_epoch, validation_losses, next_task, recalled_sample = recall_trainer(tracker, trainer, recall_source=load_config.source, **load_config.dict())
+    if tracker.can_recall:
+        start_epoch, validation_losses, next_task, recalled_sample, samples_seen = recall_trainer(tracker, trainer)
        if next_task == 'train':
            sample = recalled_sample
        if next_task == 'val':
            val_sample = recalled_sample
-        accelerator.print(f"Loaded model from {load_config.source} on epoch {start_epoch} with minimum validation loss {min(validation_losses) if len(validation_losses) > 0 else 'N/A'}")
+        accelerator.print(f"Loaded model from {type(tracker.loader).__name__} on epoch {start_epoch} having seen {samples_seen} samples with minimum validation loss {min(validation_losses) if len(validation_losses) > 0 else 'N/A'}")
        accelerator.print(f"Starting training from task {next_task} at sample {sample} and validation sample {val_sample}")
    trainer.to(device=inference_device)

-    if not exists(unet_training_mask):
-        # Then the unet mask should be true for all unets in the decoder
-        unet_training_mask = [True] * trainer.num_unets
-    assert len(unet_training_mask) == trainer.num_unets, f"The unet training mask should be the same length as the number of unets in the decoder. Got {len(unet_training_mask)} and {trainer.num_unets}"
-
    accelerator.print(print_ribbon("Generating Example Data", repeat=40))
    accelerator.print("This can take a while to load the shard lists...")
    if is_master:
@@ -306,13 +343,22 @@ def train(
        last_snapshot = sample

        if next_task == 'train':
-            for i, (img, emb) in enumerate(dataloaders["train"]):
+            for i, (img, emb, txt) in enumerate(dataloaders["train"]):
                # We want to count the total number of samples across all processes
                sample_length_tensor[0] = len(img)
                all_samples = accelerator.gather(sample_length_tensor)  # TODO: accelerator.reduce is broken when this was written. If it is fixed replace this.
                total_samples = all_samples.sum().item()
                sample += total_samples
-                img, emb = send_to_device((img, emb))
+                samples_seen += total_samples
+                img_emb = emb.get('img')
+                has_img_embedding = img_emb is not None
+                if has_img_embedding:
+                    img_emb, = send_to_device((img_emb,))
+                text_emb = emb.get('text')
+                has_text_embedding = text_emb is not None
+                if has_text_embedding:
+                    text_emb, = send_to_device((text_emb,))
+                img, = send_to_device((img,))

                trainer.train()
                for unet in range(1, trainer.num_unets+1):
@@ -320,7 +366,21 @@ def train(
                    if not unet_training_mask[unet-1]: # Unet index is the unet number - 1
                        continue

-                    loss = trainer.forward(img, image_embed=emb, unet_number=unet)
+                    forward_params = {}
+                    if has_img_embedding:
+                        forward_params['image_embed'] = img_emb
+                    else:
+                        # Forward pass automatically generates embedding
+                        pass
+                    if condition_on_text_encodings:
+                        if has_text_embedding:
+                            forward_params['text_encodings'] = text_emb
+                        else:
+                            # Then we need to pass the text instead
+                            tokenized_texts = tokenize(txt, truncate=True)
+                            assert tokenized_texts.shape[0] == len(img), f"The number of texts ({tokenized_texts.shape[0]}) should be the same as the number of images ({len(img)})"
+                            forward_params['text'] = tokenized_texts
+                    loss = trainer.forward(img, **forward_params, unet_number=unet, _device=inference_device)
                    trainer.update(unet_number=unet)
                    unet_losses_tensor[i % TRAIN_CALC_LOSS_EVERY_ITERS, unet-1] = loss
                
@@ -333,32 +393,33 @@ def train(
                    unet_all_losses = accelerator.gather(unet_losses_tensor)
                    mask = unet_all_losses != 0
                    unet_average_loss = (unet_all_losses * mask).sum(dim=0) / mask.sum(dim=0)
-                    loss_map = { f"Unet {index} Training Loss": loss.item() for index, loss in enumerate(unet_average_loss) if loss != 0 }
+                    loss_map = { f"Unet {index} Training Loss": loss.item() for index, loss in enumerate(unet_average_loss) if unet_training_mask[index] }
+
+                    # gather decay rate on each UNet
+                    ema_decay_list = {f"Unet {index} EMA Decay": ema_unet.get_current_decay() for index, ema_unet in enumerate(trainer.ema_unets) if unet_training_mask[index]}
+
                    log_data = {
                        "Epoch": epoch,
                        "Sample": sample,
                        "Step": i,
                        "Samples per second": samples_per_sec,
+                        "Samples Seen": samples_seen,
+                        **ema_decay_list,
                        **loss_map
                    }
-                    # print(f"I am rank {accelerator.state.process_index}. Example weight: {trainer.decoder.state_dict()['module.unets.0.init_conv.convs.0.weight'][0,0,0,0]}")
-                    if is_master:
-                        tracker.log(log_data, step=step(), verbose=True)

-                if is_master and last_snapshot + save_every_n_samples < sample:  # This will miss by some amount every time, but it's not a big deal... I hope
+                    if is_master:
+                        tracker.log(log_data, step=step())
+
+                if is_master and (last_snapshot + save_every_n_samples < sample or (save_immediately and i == 0)):  # This will miss by some amount every time, but it's not a big deal... I hope
                    # It is difficult to gather this kind of info on the accelerator, so we have to do it on the master
                    print("Saving snapshot")
                    last_snapshot = sample
                    # We need to know where the model should be saved
-                    save_paths = []
-                    if save_latest:
-                        save_paths.append("latest.pth")
-                    if save_all:
-                        save_paths.append(f"checkpoints/epoch_{epoch}_step_{step()}.pth")
-                    save_trainer(tracker, trainer, epoch, sample, next_task, validation_losses, save_paths)
+                    save_trainer(tracker, trainer, epoch, sample, next_task, validation_losses, samples_seen)
                    if exists(n_sample_images) and n_sample_images > 0:
                        trainer.eval()
-                        train_images, train_captions = generate_grid_samples(trainer, train_example_data, "Train: ")
+                        train_images, train_captions = generate_grid_samples(trainer, train_example_data, first_trainable_unet, last_trainable_unet, condition_on_text_encodings, cond_scale, inference_device, "Train: ")
                        tracker.log_images(train_images, captions=train_captions, image_section="Train Samples", step=step())
                
                if epoch_samples is not None and sample >= epoch_samples:
@@ -376,19 +437,41 @@ def train(
            timer = Timer()
            accelerator.wait_for_everyone()
            i = 0
-            for i, (img, emb, txt) in enumerate(dataloaders["val"]):
+            for i, (img, emb, txt) in enumerate(dataloaders['val']):  # Use the accelerate prepared loader
                val_sample_length_tensor[0] = len(img)
                all_samples = accelerator.gather(val_sample_length_tensor)
                total_samples = all_samples.sum().item()
                val_sample += total_samples
-                img, emb = send_to_device((img, emb))
+                img_emb = emb.get('img')
+                has_img_embedding = img_emb is not None
+                if has_img_embedding:
+                    img_emb, = send_to_device((img_emb,))
+                text_emb = emb.get('text')
+                has_text_embedding = text_emb is not None
+                if has_text_embedding:
+                    text_emb, = send_to_device((text_emb,))
+                img, = send_to_device((img,))

                for unet in range(1, len(decoder.unets)+1):
                    if not unet_training_mask[unet-1]: # Unet index is the unet number - 1
                        # No need to evaluate an unchanging unet
                        continue
-                    
-                    loss = trainer.forward(img.float(), image_embed=emb.float(), unet_number=unet)
+                        
+                    forward_params = {}
+                    if has_img_embedding:
+                        forward_params['image_embed'] = img_emb.float()
+                    else:
+                        # Forward pass automatically generates embedding
+                        pass
+                    if condition_on_text_encodings:
+                        if has_text_embedding:
+                            forward_params['text_encodings'] = text_emb.float()
+                        else:
+                            # Then we need to pass the text instead
+                            tokenized_texts = tokenize(txt, truncate=True)
+                            assert tokenized_texts.shape[0] == len(img), f"The number of texts ({tokenized_texts.shape[0]}) should be the same as the number of images ({len(img)})"
+                            forward_params['text'] = tokenized_texts
+                    loss = trainer.forward(img.float(), **forward_params, unet_number=unet, _device=inference_device)
                    average_val_loss_tensor[0, unet-1] += loss

                if i % VALID_CALC_LOSS_EVERY_ITERS == 0:
@@ -409,15 +492,15 @@ def train(
            if is_master:
                unet_average_val_loss = all_average_val_losses.mean(dim=0)
                val_loss_map = { f"Unet {index} Validation Loss": loss.item() for index, loss in enumerate(unet_average_val_loss) if loss != 0 }
-                tracker.log(val_loss_map, step=step(), verbose=True)
+                tracker.log(val_loss_map, step=step())
            next_task = 'eval'

        if next_task == 'eval':
            if exists(evaluate_config):
                accelerator.print(print_ribbon(f"Starting Evaluation {epoch}", repeat=40))
-                evaluation = evaluate_trainer(trainer, dataloaders["val"], inference_device, **evaluate_config.dict())
+                evaluation = evaluate_trainer(trainer, dataloaders["val"], inference_device, first_trainable_unet, last_trainable_unet, inference_device=inference_device, **evaluate_config.dict(), condition_on_text_encodings=condition_on_text_encodings, cond_scale=cond_scale)
                if is_master:
-                    tracker.log(evaluation, step=step(), verbose=True)
+                    tracker.log(evaluation, step=step())
            next_task = 'sample'
            val_sample = 0

@@ -426,28 +509,22 @@ def train(
                # Generate examples and save the model if we are the master
                # Generate sample images
                print(print_ribbon(f"Sampling Set {epoch}", repeat=40))
-                test_images, test_captions = generate_grid_samples(trainer, test_example_data, "Test: ")
-                train_images, train_captions = generate_grid_samples(trainer, train_example_data, "Train: ")
+                test_images, test_captions = generate_grid_samples(trainer, test_example_data,  first_trainable_unet, last_trainable_unet, condition_on_text_encodings, cond_scale, inference_device, "Test: ")
+                train_images, train_captions = generate_grid_samples(trainer, train_example_data,  first_trainable_unet, last_trainable_unet, condition_on_text_encodings, cond_scale, inference_device, "Train: ")
                tracker.log_images(test_images, captions=test_captions, image_section="Test Samples", step=step())
                tracker.log_images(train_images, captions=train_captions, image_section="Train Samples", step=step())

                print(print_ribbon(f"Starting Saving {epoch}", repeat=40))
-                # Get the same paths
-                save_paths = []
-                if save_latest:
-                    save_paths.append("latest.pth")
+                is_best = False
                if all_average_val_losses is not None:
-                    average_loss = all_average_val_losses.mean(dim=0).item()
-                    if save_best and (len(validation_losses) == 0 or average_loss < min(validation_losses)):
-                        save_paths.append("best.pth")
+                    average_loss = all_average_val_losses.mean(dim=0).sum() / sum(unet_training_mask)
+                    if len(validation_losses) == 0 or average_loss < min(validation_losses):
+                        is_best = True
                    validation_losses.append(average_loss)
-                save_trainer(tracker, trainer, epoch, sample, next_task, validation_losses, save_paths)
+                save_trainer(tracker, trainer, epoch, sample, next_task, validation_losses, samples_seen, is_best=is_best)
            next_task = 'train'

-def create_tracker(accelerator, config, config_path, tracker_type=None, data_path=None):
-    """
-    Creates a tracker of the specified type and initializes special features based on the full config
-    """
+def create_tracker(accelerator: Accelerator, config: TrainDecoderConfig, config_path: str, dummy: bool = False) -> Tracker:
    tracker_config = config.tracker
    accelerator_config = {
        "Distributed": accelerator.distributed_type != accelerate_dataclasses.DistributedType.NO,
@@ -455,41 +532,33 @@ def create_tracker(accelerator, config, config_path, tracker_type=None, data_pat
        "NumProcesses": accelerator.num_processes,
        "MixedPrecision": accelerator.mixed_precision
    }
-    init_config = { "config": {**config.dict(), **accelerator_config} }
-    data_path = data_path or tracker_config.data_path
-    tracker_type = tracker_type or tracker_config.tracker_type
-
-    if tracker_type == "dummy":
-        tracker = DummyTracker(data_path)
-        tracker.init(**init_config)
-    elif tracker_type == "console":
-        tracker = ConsoleTracker(data_path)
-        tracker.init(**init_config)
-    elif tracker_type == "wandb":
-        # We need to initialize the resume state here
-        load_config = config.load
-        if load_config.source == "wandb" and load_config.resume:
-            # Then we are resuming the run load_config["run_path"]
-            run_id = load_config.run_path.split("/")[-1]
-            init_config["id"] = run_id
-            init_config["resume"] = "must"
-
-        init_config["entity"] = tracker_config.wandb_entity
-        init_config["project"] = tracker_config.wandb_project
-        tracker = WandbTracker(data_path)
-        tracker.init(**init_config)
-        tracker.save_file(str(config_path.absolute()), str(config_path.parent.absolute()))
-    else:
-        raise ValueError(f"Tracker type {tracker_type} not supported by decoder trainer")
+    tracker: Tracker = tracker_config.create(config, accelerator_config, dummy_mode=dummy)
+    tracker.save_config(config_path, config_name='decoder_config.json')
+    tracker.add_save_metadata(state_dict_key='config', metadata=config.dict())
    return tracker
    
-def initialize_training(config, config_path):
+def initialize_training(config: TrainDecoderConfig, config_path):
    # Make sure if we are not loading, distributed models are initialized to the same values
    torch.manual_seed(config.seed)

    # Set up accelerator for configurable distributed training
-    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
-    accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])
+    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=config.train.find_unused_parameters)
+    init_kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=60*60))
+    accelerator = Accelerator(kwargs_handlers=[ddp_kwargs, init_kwargs])
+
+    if accelerator.num_processes > 1:
+        # We are using distributed training and want to immediately ensure all can connect
+        accelerator.print("Waiting for all processes to connect...")
+        accelerator.wait_for_everyone()
+        accelerator.print("All processes online and connected")
+
+    # If we are in deepspeed fp16 mode, we must ensure learned variance is off
+    if accelerator.mixed_precision == "fp16" and accelerator.distributed_type == accelerate_dataclasses.DistributedType.DEEPSPEED and config.decoder.learned_variance:
+        raise ValueError("DeepSpeed fp16 mode does not support learned variance")
+
+    if accelerator.process_index != accelerator.local_process_index and accelerator.distributed_type == accelerate_dataclasses.DistributedType.DEEPSPEED:
+        # This is an invalid configuration until we figure out how to handle this
+        raise ValueError("DeepSpeed does not support multi-node distributed training")
    
    # Set up data
    all_shards = list(range(config.data.start_shard, config.data.end_shard + 1))
@@ -512,19 +581,44 @@ def initialize_training(config, config_path):

    # Create the decoder model and print basic info
    decoder = config.decoder.create()
-    num_parameters = sum(p.numel() for p in decoder.parameters())
+    get_num_parameters = lambda model, only_training=False: sum(p.numel() for p in model.parameters() if (p.requires_grad or not only_training))

    # Create and initialize the tracker if we are the master
-    tracker = create_tracker(accelerator, config, config_path) if rank == 0 else create_tracker(accelerator, config, config_path, tracker_type="dummy")
+    tracker = create_tracker(accelerator, config, config_path, dummy = rank!=0)
+
+    has_img_embeddings = config.data.img_embeddings_url is not None
+    has_text_embeddings = config.data.text_embeddings_url is not None
+    conditioning_on_text = any([unet.cond_on_text_encodings for unet in config.decoder.unets])
+
+    has_clip_model = config.decoder.clip is not None
+    data_source_string = ""
+
+    if has_img_embeddings:
+        data_source_string += "precomputed image embeddings"
+    elif has_clip_model:
+        data_source_string += "clip image embeddings generation"
+    else:
+        raise ValueError("No image embeddings source specified")
+    if conditioning_on_text:
+        if has_text_embeddings:
+            data_source_string += " and precomputed text embeddings"
+        elif has_clip_model:
+            data_source_string += " and clip text encoding generation"
+        else:
+            raise ValueError("No text embeddings source specified")

    accelerator.print(print_ribbon("Loaded Config", repeat=40))
    accelerator.print(f"Running training with {accelerator.num_processes} processes and {accelerator.distributed_type} distributed training")
-    accelerator.print(f"Number of parameters: {num_parameters}")
+    accelerator.print(f"Training using {data_source_string}. {'conditioned on text' if conditioning_on_text else 'not conditioned on text'}")
+    accelerator.print(f"Number of parameters: {get_num_parameters(decoder)} total; {get_num_parameters(decoder, only_training=True)} training")
+    for i, unet in enumerate(decoder.unets):
+        accelerator.print(f"Unet {i} has {get_num_parameters(unet)} total; {get_num_parameters(unet, only_training=True)} training")
+
    train(dataloaders, decoder, accelerator,
        tracker=tracker,
        inference_device=accelerator.device,
-        load_config=config.load,
        evaluate_config=config.evaluate,
+        condition_on_text_encodings=conditioning_on_text,
        **config.train.dict(),
    )
    
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -1,31 +1,23 @@
-# TODO: add start, num_data_points, eval_every and group to config
-# TODO: switch back to repo's wandb
-
-START = 0
-NUM_DATA_POINTS = 250e6
-EVAL_EVERY = 1000
-GROUP = "distributed"
-
-import os
 import click
-import wandb
-
 import torch
+
 from torch import nn
-from torch.utils.data import DataLoader
-
-import numpy as np
-
+from typing import List
 from accelerate import Accelerator
+from accelerate.utils import set_seed
+from torch.utils.data import DataLoader
+from embedding_reader import EmbeddingReader
+from accelerate.utils import dataclasses as accelerate_dataclasses

-from dalle2_pytorch.dataloaders import get_reader, make_splits
 from dalle2_pytorch.utils import Timer
+from dalle2_pytorch.trackers import Tracker
+from dalle2_pytorch import DiffusionPriorTrainer
+from dalle2_pytorch.dataloaders import get_reader, make_splits
 from dalle2_pytorch.train_configs import (
+    DiffusionPriorConfig,
    DiffusionPriorTrainConfig,
    TrainDiffusionPriorConfig,
 )
-from dalle2_pytorch.trackers import BaseTracker, WandbTracker
-from dalle2_pytorch import DiffusionPriorTrainer


 # helpers
@@ -38,8 +30,19 @@ def exists(val):
    return val is not None


+def all_between(values: list, lower_bound, upper_bound):
+    for value in values:
+        if value < lower_bound or value > upper_bound:
+            return False
+
+    return True
+
+
 def make_model(
-    prior_config, train_config, device: str = None, accelerator: Accelerator = None
+    prior_config: DiffusionPriorConfig,
+    train_config: DiffusionPriorTrainConfig,
+    device: str = None,
+    accelerator: Accelerator = None,
 ):
    # create model from config
    diffusion_prior = prior_config.create()
@@ -54,71 +57,214 @@ def make_model(
        use_ema=train_config.use_ema,
        device=device,
        accelerator=accelerator,
+        warmup_steps=train_config.warmup_steps,
    )

    return trainer


+def create_tracker(
+    accelerator: Accelerator,
+    config: TrainDiffusionPriorConfig,
+    config_path: str,
+    dummy: bool = False,
+) -> Tracker:
+    tracker_config = config.tracker
+
+    accelerator_config = {
+        "Distributed": accelerator.distributed_type
+        != accelerate_dataclasses.DistributedType.NO,
+        "DistributedType": accelerator.distributed_type,
+        "NumProcesses": accelerator.num_processes,
+        "MixedPrecision": accelerator.mixed_precision,
+    }
+
+    tracker: Tracker = tracker_config.create(
+        config, accelerator_config, dummy_mode=dummy
+    )
+
+    tracker.save_config(config_path, config_name="prior_config.json")
+
+    return tracker
+
+
+def pad_gather_reduce(trainer: DiffusionPriorTrainer, x, method="mean"):
+    """
+    pad a value or tensor across all processes and gather
+
+    params:
+        - trainer: a trainer that carries an accelerator object
+        - x: a number or torch tensor to reduce
+        - method: "mean", "sum", "max", "min"
+
+    return:
+        - the average tensor after maskin out 0's
+        - None if the gather resulted in an empty tensor
+    """
+
+    assert method in [
+        "mean",
+        "sum",
+        "max",
+        "min",
+    ], "This function has limited capabilities [sum, mean, max, min]"
+    assert type(x) is not None, "Cannot reduce a None type object"
+
+    # wait for everyone to arrive here before gathering
+
+    if type(x) is not torch.Tensor:
+        x = torch.tensor([x])
+
+    # verify that the tensor is on the proper device
+    x = x.to(trainer.device)
+
+    # pad across processes
+    padded_x = trainer.accelerator.pad_across_processes(x, dim=0)
+
+    # gather across all procesess
+    gathered_x = trainer.accelerator.gather(padded_x)
+
+    # mask out zeros
+    masked_x = gathered_x[gathered_x != 0]
+
+    # if the tensor is empty, warn and return None
+    if len(masked_x) == 0:
+        click.secho(
+            f"The call to this method resulted in an empty tensor after masking out zeros. The gathered tensor was this: {gathered_x} and the original value passed was: {x}.",
+            fg="red",
+        )
+        return None
+
+    if method == "mean":
+        return torch.mean(masked_x)
+    elif method == "sum":
+        return torch.sum(masked_x)
+    elif method == "max":
+        return torch.max(masked_x)
+    elif method == "min":
+        return torch.min(masked_x)
+
+
+def save_trainer(
+    tracker: Tracker,
+    trainer: DiffusionPriorTrainer,
+    is_latest: bool,
+    is_best: bool,
+    epoch: int,
+    samples_seen: int,
+    best_validation_loss: float,
+):
+    """
+    Logs the model with an appropriate method depending on the tracker
+    """
+    trainer.accelerator.wait_for_everyone()
+
+    if trainer.accelerator.is_main_process:
+        click.secho(
+            f"RANK:{trainer.accelerator.process_index} | Saving Model | Best={is_best} | Latest={is_latest}",
+            fg="magenta",
+        )
+
+    tracker.save(
+        trainer=trainer,
+        is_best=is_best,
+        is_latest=is_latest,
+        epoch=int(epoch),
+        samples_seen=int(samples_seen),
+        best_validation_loss=best_validation_loss,
+    )
+
+
+def recall_trainer(tracker: Tracker, trainer: DiffusionPriorTrainer):
+    """
+    Loads the model with an appropriate method depending on the tracker
+    """
+
+    if trainer.accelerator.is_main_process:
+        click.secho(f"Loading model from {type(tracker.loader).__name__}", fg="yellow")
+
+    state_dict = tracker.recall()
+
+    trainer.load(state_dict, strict=True)
+
+    return (
+        int(state_dict.get("epoch", 0)),
+        state_dict.get("best_validation_loss", 0),
+        int(state_dict.get("samples_seen", 0)),
+    )
+
+
 # eval functions


-def eval_model(
+def report_validation_loss(
    trainer: DiffusionPriorTrainer,
    dataloader: DataLoader,
    text_conditioned: bool,
+    use_ema: bool,
+    tracker: Tracker,
+    split: str,
+    tracker_folder: str,
    loss_type: str,
-    tracker_context: str,
-    tracker: BaseTracker = None,
-    use_ema: bool = True,
 ):
-    trainer.eval()
-    if trainer.is_main_process():
-        click.secho(f"Measuring performance on {tracker_context}", fg="green", blink=True)
+    """
+    Compute the validation loss on a given subset of data.
+    """

-    with torch.no_grad():
-        total_loss = 0.0
-        total_samples = 0.0
+    if trainer.accelerator.is_main_process:
+        click.secho(
+            f"Measuring performance on {use_ema}-{split} split",
+            fg="green",
+            blink=True,
+        )

-        for image_embeddings, text_data in dataloader:
-            image_embeddings = image_embeddings.to(trainer.device)
-            text_data = text_data.to(trainer.device)
+    total_loss = torch.zeros(1, dtype=torch.float, device=trainer.device)

-            batches = image_embeddings.shape[0]
+    for image_embeddings, text_data in dataloader:
+        image_embeddings = image_embeddings.to(trainer.device)
+        text_data = text_data.to(trainer.device)

-            input_args = dict(image_embed=image_embeddings)
+        input_args = dict(image_embed=image_embeddings)

-            if text_conditioned:
-                input_args = dict(**input_args, text=text_data)
-            else:
-                input_args = dict(**input_args, text_embed=text_data)
+        if text_conditioned:
+            input_args = dict(**input_args, text=text_data)
+        else:
+            input_args = dict(**input_args, text_embed=text_data)

-            if use_ema:
-                loss = trainer.ema_diffusion_prior(**input_args)
-            else:
-                loss = trainer(**input_args)
+        if use_ema:
+            loss = trainer.ema_diffusion_prior(**input_args)
+        else:
+            loss = trainer(**input_args)

-            total_loss += loss * batches
-            total_samples += batches
+        total_loss += loss

-        avg_loss = total_loss / total_samples
+    # compute the average loss across all processes

-        stats = {f"{tracker_context}-{loss_type}": avg_loss}
-        trainer.print(stats)
+    avg_loss = pad_gather_reduce(trainer, total_loss, method="mean")
+    stats = {f"{tracker_folder}/{loss_type}-loss": avg_loss}

-        if exists(tracker):
-            tracker.log(stats, step=trainer.step.item() + 1)
+    # print and log results on main process
+    tracker.log(stats, step=trainer.step.item() + 1)
+
+    return avg_loss


 def report_cosine_sims(
    trainer: DiffusionPriorTrainer,
    dataloader: DataLoader,
    text_conditioned: bool,
-    tracker: BaseTracker,
-    tracker_context: str = "validation",
+    tracker: Tracker,
+    split: str,
+    timesteps: int,
+    tracker_folder: str,
 ):
    trainer.eval()
-    if trainer.is_main_process():
-        click.secho("Measuring Cosine-Similarity", fg="green", blink=True)
+    if trainer.accelerator.is_main_process:
+        click.secho(
+            f"Measuring Cosine-Similarity on {split} split with {timesteps} timesteps",
+            fg="green",
+            blink=True,
+        )

    for test_image_embeddings, text_data in dataloader:
        test_image_embeddings = test_image_embeddings.to(trainer.device)
@@ -126,10 +272,8 @@ def report_cosine_sims(

        # we are text conditioned, we produce an embedding from the tokenized text
        if text_conditioned:
-            text_embedding, text_encodings, text_mask = trainer.embed_text(text_data)
-            text_cond = dict(
-                text_embed=text_embedding, text_encodings=text_encodings, mask=text_mask
-            )
+            text_embedding, text_encodings = trainer.embed_text(text_data)
+            text_cond = dict(text_embed=text_embedding, text_encodings=text_encodings)
        else:
            text_embedding = text_data
            text_cond = dict(text_embed=text_embedding)
@@ -146,15 +290,11 @@ def report_cosine_sims(

        if text_conditioned:
            text_encodings_shuffled = text_encodings[rolled_idx]
-            text_mask_shuffled = text_mask[rolled_idx]
        else:
            text_encodings_shuffled = None
-            text_mask_shuffled = None

        text_cond_shuffled = dict(
-            text_embed=text_embed_shuffled,
-            text_encodings=text_encodings_shuffled,
-            mask=text_mask_shuffled,
+            text_embed=text_embed_shuffled, text_encodings=text_encodings_shuffled
        )

        # prepare the text embedding
@@ -167,7 +307,9 @@ def report_cosine_sims(

        # predict on the unshuffled text embeddings
        predicted_image_embeddings = trainer.p_sample_loop(
-            test_image_embeddings.shape, text_cond
+            test_image_embeddings.shape,
+            text_cond,
+            timesteps=timesteps,
        )

        predicted_image_embeddings = (
@@ -177,7 +319,9 @@ def report_cosine_sims(

        # predict on the shuffled embeddings
        predicted_unrelated_embeddings = trainer.p_sample_loop(
-            test_image_embeddings.shape, text_cond_shuffled
+            test_image_embeddings.shape,
+            text_cond_shuffled,
+            timesteps=timesteps,
        )

        predicted_unrelated_embeddings = (
@@ -186,32 +330,97 @@ def report_cosine_sims(
        )

        # calculate similarities
-        original_similarity = cos(text_embed, test_image_embeddings).cpu().numpy()
-        predicted_similarity = cos(text_embed, predicted_image_embeddings).cpu().numpy()
-        unrelated_similarity = (
-            cos(text_embed, predicted_unrelated_embeddings).cpu().numpy()
+        orig_sim = pad_gather_reduce(
+            trainer, cos(text_embed, test_image_embeddings), method="mean"
        )
-        predicted_img_similarity = (
-            cos(test_image_embeddings, predicted_image_embeddings).cpu().numpy()
+        pred_sim = pad_gather_reduce(
+            trainer, cos(text_embed, predicted_image_embeddings), method="mean"
+        )
+        unrel_sim = pad_gather_reduce(
+            trainer, cos(text_embed, predicted_unrelated_embeddings), method="mean"
+        )
+        pred_img_sim = pad_gather_reduce(
+            trainer,
+            cos(test_image_embeddings, predicted_image_embeddings),
+            method="mean",
        )

        stats = {
-            f"{tracker_context}/baseline similarity": np.mean(original_similarity),
-            f"{tracker_context}/similarity with text": np.mean(predicted_similarity),
-            f"{tracker_context}/similarity with original image": np.mean(
-                predicted_img_similarity
-            ),
-            f"{tracker_context}/similarity with unrelated caption": np.mean(unrelated_similarity),
-            f"{tracker_context}/difference from baseline similarity": np.mean(
-                predicted_similarity - original_similarity
-            ),
+            f"{tracker_folder}/baseline similarity [steps={timesteps}]": orig_sim,
+            f"{tracker_folder}/similarity with text [steps={timesteps}]": pred_sim,
+            f"{tracker_folder}/similarity with original image [steps={timesteps}]": pred_img_sim,
+            f"{tracker_folder}/similarity with unrelated caption [steps={timesteps}]": unrel_sim,
+            f"{tracker_folder}/difference from baseline similarity [steps={timesteps}]": pred_sim
+            - orig_sim,
        }

-        for k, v in stats.items():
-            trainer.print(f"{tracker_context}/{k}: {v}")
+        tracker.log(stats, step=trainer.step.item() + 1)

-        if exists(tracker):
-            tracker.log(stats, step=trainer.step.item() + 1)
+
+def eval_model(
+    trainer: DiffusionPriorTrainer,
+    dataloader: DataLoader,
+    text_conditioned: bool,
+    split: str,
+    tracker: Tracker,
+    use_ema: bool,
+    report_cosine: bool,
+    report_loss: bool,
+    timesteps: List[int],
+    loss_type: str = None,
+):
+    """
+    Run evaluation on a model and track metrics
+
+    returns: loss if requested
+    """
+    trainer.eval()
+
+    use_ema = "ema" if use_ema else "online"
+    tracker_folder = f"metrics/{use_ema}-{split}"
+
+    # detemine if valid timesteps are passed
+
+    min_timesteps = trainer.accelerator.unwrap_model(
+        trainer.diffusion_prior
+    ).sample_timesteps
+    max_timesteps = trainer.accelerator.unwrap_model(
+        trainer.diffusion_prior
+    ).noise_scheduler.num_timesteps
+
+    assert all_between(
+        timesteps, lower_bound=min_timesteps, upper_bound=max_timesteps
+    ), f"all timesteps values must be between {min_timesteps} and {max_timesteps}: got {timesteps}"
+
+    # measure cosine metrics across various eta and timesteps
+
+    if report_cosine:
+        for timestep in timesteps:
+            report_cosine_sims(
+                trainer,
+                dataloader=dataloader,
+                text_conditioned=text_conditioned,
+                tracker=tracker,
+                split=split,
+                timesteps=timestep,
+                tracker_folder=tracker_folder,
+            )
+
+    # measure loss on a seperate split of data
+
+    if report_loss:
+        loss = report_validation_loss(
+            trainer=trainer,
+            dataloader=dataloader,
+            text_conditioned=text_conditioned,
+            use_ema=use_ema,
+            tracker=tracker,
+            split=split,
+            tracker_folder=tracker_folder,
+            loss_type=loss_type,
+        )
+
+        return loss


 # training script
@@ -219,182 +428,327 @@ def report_cosine_sims(

 def train(
    trainer: DiffusionPriorTrainer,
+    tracker: Tracker,
    train_loader: DataLoader,
    eval_loader: DataLoader,
    test_loader: DataLoader,
    config: DiffusionPriorTrainConfig,
 ):
-    # distributed tracking with wandb
-    if trainer.accelerator.num_processes > 1:
-        os.environ["WANDB_START_METHOD"] = "thread"
+    # init timers
+    save_timer = Timer()  # when to save
+    samples_timer = Timer()  # samples/sec
+    validation_profiler = Timer()  # how long is validation taking
+    validation_countdown = Timer()  # when to perform evalutation

-    tracker = wandb.init(
-        name=f"RANK:{trainer.device}",
-        entity=config.tracker.wandb_entity,
-        project=config.tracker.wandb_project,
-        config=config.dict(),
-        group=GROUP,
-    )
+    # keep track of best validation loss

-    # sync after tracker init
-    trainer.wait_for_everyone()
-
-    # init a timer
-    timer = Timer()
+    best_validation_loss = config.train.best_validation_loss
+    samples_seen = config.train.num_samples_seen

    # do training
-    for img, txt in train_loader:
-        trainer.train()
-        current_step = trainer.step.item() + 1

-        # place data on device
-        img = img.to(trainer.device)
-        txt = txt.to(trainer.device)
+    start_epoch = config.train.current_epoch

-        # pass to model
-        loss = trainer(text=txt, image_embed=img)
+    for epoch in range(start_epoch, config.train.epochs):
+        # if we finished out an old epoch, reset the distribution to be a full epoch
+        tracker.log({"tracking/epoch": epoch}, step=trainer.step.item())

-        # display & log loss (will only print from main process)
-        trainer.print(f"Step {current_step}: Loss {loss}")
+        if train_loader.dataset.get_start() > 0 and epoch == start_epoch+1:
+            if trainer.accelerator.is_main_process:
+                click.secho(f"Finished resumed epoch...resetting dataloader.")
+            train_loader.dataset.set_start(0)

-        # perform backprop & apply EMA updates
-        trainer.update()
+        for img, txt in train_loader:
+            # setup things every step

-        # track samples/sec/rank
-        samples_per_sec = img.shape[0] / timer.elapsed()
+            trainer.train()
+            current_step = trainer.step.item()
+            samples_timer.reset()

-        # samples seen
-        samples_seen = (
-            config.data.batch_size * trainer.accelerator.num_processes * current_step
-        )
+            # place data on device

-        # ema decay
-        ema_decay = trainer.ema_diffusion_prior.get_current_decay()
+            img = img.to(trainer.device)
+            txt = txt.to(trainer.device)

-        # Log on all processes for debugging
-        tracker.log(
-            {
-                "tracking/samples-sec": samples_per_sec,
-                "tracking/samples-seen": samples_seen,
-                "tracking/ema-decay": ema_decay,
-                "metrics/training-loss": loss,
-            },
-            step=current_step,
-        )
+            # pass to model

-        # Metric Tracking & Checkpointing (outside of timer's scope)
-        if current_step % EVAL_EVERY == 0:
-            eval_model(
-                trainer=trainer,
-                dataloader=eval_loader,
-                text_conditioned=config.prior.condition_on_text_encodings,
-                loss_type=config.prior.loss_type,
-                tracker_context="metrics/online-model-validation",
-                tracker=tracker,
-                use_ema=False,
+            loss = trainer(text=txt, image_embed=img)
+
+            # perform backprop & apply EMA updates
+
+            trainer.update()
+
+            # gather info about training step
+
+            all_loss = pad_gather_reduce(trainer, loss, method="mean")
+            num_samples = pad_gather_reduce(trainer, len(txt), method="sum")
+            samples_per_sec = num_samples / samples_timer.elapsed()
+            samples_seen += num_samples
+            ema_decay = trainer.ema_diffusion_prior.get_current_decay()
+
+            # log
+
+            tracker.log(
+                {
+                    "tracking/samples-sec": samples_per_sec,
+                    "tracking/samples-seen": samples_seen,
+                    "tracking/ema-decay": ema_decay,
+                    f"tracking/training-{config.prior.loss_type}": all_loss,
+                },
+                step=current_step,
            )

-            eval_model(
-                trainer=trainer,
-                dataloader=eval_loader,
-                text_conditioned=config.prior.condition_on_text_encodings,
-                loss_type=config.prior.loss_type,
-                tracker_context="metrics/ema-model-validation",
-                tracker=tracker,
-                use_ema=True,
+            # Metric Tracking @ Timed Intervals
+
+            eval_delta = pad_gather_reduce(
+                trainer, validation_countdown.elapsed(), method="min"
            )

-            report_cosine_sims(
-                trainer=trainer,
-                dataloader=eval_loader,
-                text_conditioned=config.prior.condition_on_text_encodings,
-                tracker=tracker,
-                tracker_context="metrics",
-            )
+            if eval_delta != None and eval_delta > config.data.eval_every_seconds:
+                # begin timing how long this takes

-        if current_step % config.train.save_every == 0:
-            trainer.save(f"{config.tracker.data_path}/chkpt_step_{current_step}.pth")
+                validation_profiler.reset()

-        # reset timer for next round
-        timer.reset()
+                # package kwargs for evaluation
+
+                eval_kwargs = {
+                    "trainer": trainer,
+                    "tracker": tracker,
+                    "text_conditioned": config.prior.condition_on_text_encodings,
+                    "timesteps": config.train.eval_timesteps,
+                }
+
+                # ONLINE MODEL : COSINE : LOSS : VALIDATION SPLIT
+
+                eval_model(
+                    dataloader=eval_loader,
+                    loss_type=config.prior.loss_type,
+                    split="validation",
+                    use_ema=False,
+                    report_cosine=False,
+                    report_loss=True,
+                    **eval_kwargs,
+                )
+
+                # EMA MODEL : COSINE : LOSS : VALIDATION DATA
+
+                ema_val_loss = eval_model(
+                    dataloader=eval_loader,
+                    loss_type=config.prior.loss_type,
+                    split="validation",
+                    use_ema=True,
+                    report_cosine=True,
+                    report_loss=True,
+                    **eval_kwargs,
+                )
+
+                tracker.log(
+                    {
+                        "tracking/validation length (minutes)": validation_profiler.elapsed()
+                        / 60
+                    }
+                )
+
+                # check if the ema validation is the lowest seen yet
+
+                if ema_val_loss < best_validation_loss:
+                    best_validation_loss = ema_val_loss
+
+                    #  go save the model as best
+
+                    save_trainer(
+                        trainer=trainer,
+                        tracker=tracker,
+                        is_best=True,
+                        is_latest=False,
+                        samples_seen=samples_seen,
+                        epoch=epoch,
+                        best_validation_loss=best_validation_loss,
+                    )
+
+                # reset timer for validaiton
+
+                validation_countdown.reset()
+
+            elif eval_delta is None:
+                click.secho(
+                    f"Error occured reading the eval time on rank: {trainer.device}",
+                    fg="yellow",
+                )
+
+            # save as latest model on schedule
+
+            save_delta = pad_gather_reduce(trainer, save_timer.elapsed(), method="min")
+
+            if save_delta != None and save_delta >= config.train.save_every_seconds:
+                save_trainer(
+                    trainer=trainer,
+                    tracker=tracker,
+                    is_best=False,
+                    is_latest=True,
+                    samples_seen=samples_seen,
+                    epoch=epoch,
+                    best_validation_loss=best_validation_loss,
+                )
+
+                save_timer.reset()
+
+            elif save_delta is None:
+                click.secho(
+                    f"Error occured reading the save time on rank: {trainer.device}",
+                    fg="yellow",
+                )

    # evaluate on test data

-    eval_model(
+    if trainer.accelerator.is_main_process:
+        click.secho(f"Starting Test", fg="red")
+
+    # save one last time as latest before beginning validation
+
+    save_trainer(
+        tracker=tracker,
+        trainer=trainer,
+        is_best=False,
+        is_latest=True,
+        samples_seen=samples_seen,
+        epoch=epoch,
+        best_validation_loss=best_validation_loss,
+    )
+
+    test_loss = eval_model(
        trainer=trainer,
        dataloader=test_loader,
        text_conditioned=config.prior.condition_on_text_encodings,
-        loss_type=config.prior.loss_type,
-        tracker_context="test",
+        split="test",
        tracker=tracker,
+        use_ema=True,
+        report_cosine=False,
+        report_loss=True,
+        timesteps=config.train.eval_timesteps,
+        loss_type=config.prior.loss_type,
    )

-    report_cosine_sims(
-        trainer,
-        test_loader,
-        config.prior.condition_on_text_encodings,
-        tracker,
-        tracker_context="test",
-    )
+    if test_loss < best_validation_loss:
+        best_validation_loss = test_loss
+
+        #  go save the model as best
+
+        save_trainer(
+            trainer=trainer,
+            tracker=tracker,
+            is_best=True,
+            is_latest=False,
+            samples_seen=samples_seen,
+            epoch=epoch,
+            best_validation_loss=test_loss,
+        )


-def initialize_training(config, accelerator=None):
+def initialize_training(config_file, accelerator):
    """
    Parse the configuration file, and prepare everything necessary for training
    """
+    # load the configuration file
+    if accelerator.is_main_process:
+        click.secho(f"Loading configuration from {config_file}", fg="green")
+
+    config = TrainDiffusionPriorConfig.from_json_path(config_file)
+
+    # seed
+
+    set_seed(config.train.random_seed)

    # get a device

-    if accelerator:
-        device = accelerator.device
-        click.secho(f"Accelerating on: {device}", fg="yellow")
-    else:
-        if torch.cuda.is_available():
-            click.secho("GPU detected, defaulting to cuda:0", fg="yellow")
-            device = "cuda:0"
-        else:
-            click.secho("No GPU detected...using cpu", fg="yellow")
-            device = "cpu"
+    device = accelerator.device

    # make the trainer (will automatically distribute if possible & configured)

-    trainer = make_model(config.prior, config.train, device, accelerator).to(device)
+    trainer: DiffusionPriorTrainer = make_model(
+        config.prior, config.train, device, accelerator
+    ).to(device)
+
+    # create a tracker
+
+    tracker = create_tracker(
+        accelerator, config, config_file, dummy=accelerator.process_index != 0
+    )

    # reload from chcekpoint

-    if config.load.resume == True:
-        click.secho(f"Loading checkpoint: {config.load.source}", fg="cyan")
-        trainer.load(config.load.source)
+    if tracker.can_recall:
+        current_epoch, best_validation_loss, samples_seen = recall_trainer(
+            tracker=tracker, trainer=trainer
+        )
+
+        # display best values
+        if trainer.accelerator.is_main_process:
+            click.secho(f"Current Epoch: {current_epoch} | Best Val Loss: {best_validation_loss} | Samples Seen: {samples_seen}", fg="yellow")
+
+        # update config to reflect recalled values
+        config.train.num_samples_seen = samples_seen
+        config.train.current_epoch = current_epoch
+        config.train.best_validation_loss = best_validation_loss

    # fetch and prepare data

-    if trainer.is_main_process():
-        click.secho("Grabbing data from source", fg="blue", blink=True)
+    if trainer.accelerator.is_main_process:
+        click.secho("Grabbing data...", fg="blue", blink=True)

+    trainer.accelerator.wait_for_everyone()
    img_reader = get_reader(
        text_conditioned=trainer.text_conditioned,
        img_url=config.data.image_url,
        meta_url=config.data.meta_url,
    )

+    # calculate start point within epoch
+
+    trainer.accelerator.wait_for_everyone()
+
    train_loader, eval_loader, test_loader = make_splits(
        text_conditioned=trainer.text_conditioned,
        batch_size=config.data.batch_size,
-        num_data_points=NUM_DATA_POINTS,
+        num_data_points=config.data.num_data_points,
        train_split=config.data.splits.train,
        eval_split=config.data.splits.val,
        image_reader=img_reader,
-        rank=accelerator.state.process_index if exists(accelerator) else 0,
-        world_size=accelerator.state.num_processes if exists(accelerator) else 1,
-        start=START,
+        rank=accelerator.state.process_index,
+        world_size=accelerator.state.num_processes,
+        start=0,
    )

-    # wait for everyone to load data before continuing
-    trainer.wait_for_everyone()
+    # update the start point to finish out the epoch on a resumed run
+
+    if tracker.can_recall:
+        samples_seen = config.train.num_samples_seen
+        length = (
+            config.data.num_data_points
+            if samples_seen <= img_reader.count
+            else img_reader.count
+        )
+        scaled_samples = length * config.train.current_epoch
+        start_point = (
+            scaled_samples - samples_seen if scaled_samples > samples_seen else samples_seen
+        )
+
+        if trainer.accelerator.is_main_process:
+            click.secho(f"Resuming at sample: {start_point}", fg="yellow")
+
+        train_loader.dataset.set_start(start_point)

    # start training
+
+    if trainer.accelerator.is_main_process:
+        click.secho(
+            f"Beginning Prior Training : Distributed={accelerator.state.distributed_type != accelerate_dataclasses.DistributedType.NO}",
+            fg="yellow",
+        )
+
    train(
        trainer=trainer,
+        tracker=tracker,
        train_loader=train_loader,
        eval_loader=eval_loader,
        test_loader=test_loader,
@@ -403,23 +757,13 @@ def initialize_training(config, accelerator=None):


@click.command()
-@click.option("--hfa", default=True)
-@click.option("--config_path", default="configs/prior.json")
-def main(hfa, config_path):
-    # start HFA if requested
-    if hfa:
-        accelerator = Accelerator()
-    else:
-        accelerator = None
+@click.option("--config_file", default="configs/train_prior_config.example.json")
+def main(config_file):
+    # start HFA
+    accelerator = Accelerator()

-    # load the configuration file on main process
-    if not exists(accelerator) or accelerator.is_main_process:
-        click.secho(f"Loading configuration from {config_path}", fg="green")
-
-    config = TrainDiffusionPriorConfig.from_json_path(config_path)
-
-    # send config to get processed
-    initialize_training(config, accelerator)
+    # setup training
+    initialize_training(config_file, accelerator)


 if __name__ == "__main__":
Author	SHA1	Message	Date
Phil Wang	301a97197f	fix self conditioning shape in diffusion prior	2022-08-12 12:29:25 -07:00
Phil Wang	9440411954	make self conditioning technique work with diffusion prior	2022-08-12 12:20:51 -07:00
Phil Wang	981d407792	comment	2022-08-12 11:41:23 -07:00
Phil Wang	7c5477b26d	bet on the new self-conditioning technique out of geoffrey hintons group	2022-08-12 11:36:08 -07:00
Phil Wang	be3bb868bf	add gradient checkpointing for all resnet blocks	2022-08-02 19:21:44 -07:00
Phil Wang	451de34871	enforce clip anytorch version	2022-07-30 10:07:55 -07:00
Phil Wang	f22e8c8741	make open clip available for use with dalle2 pytorch	2022-07-30 09:02:31 -07:00
Phil Wang	87432e93ad	quick fix for linear attention	2022-07-29 13:17:12 -07:00
Phil Wang	d167378401	add cosine sim for self attention as well, as a setting	2022-07-29 12:48:20 -07:00
Phil Wang	2d67d5821e	change up epsilon in layernorm the case of using fp16, thanks to @Veldrovive for figuring out this stabilizes training	2022-07-29 12:41:02 -07:00
Phil Wang	748c7fe7af	allow for cosine sim cross attention, modify linear attention in attempt to resolve issue on fp16	2022-07-29 11:12:18 -07:00
Phil Wang	80046334ad	make sure entire readme runs without errors	2022-07-28 10:17:43 -07:00
Phil Wang	36fb46a95e	fix readme and a small bug in DALLE2 class	2022-07-28 08:33:51 -07:00
Phil Wang	07abfcf45b	rescale values in linear attention to mitigate overflows in fp16 setting	2022-07-27 12:27:38 -07:00
Phil Wang	2e35a9967d	product management	2022-07-26 11:10:16 -07:00
Phil Wang	406e75043f	add upsample combiner feature for the unets	2022-07-26 10:46:04 -07:00
Phil Wang	9646dfc0e6	fix path_or_state bug	2022-07-26 09:47:54 -07:00
Phil Wang	62043acb2f	fix repaint	2022-07-24 15:29:06 -07:00
Phil Wang	417ff808e6	1.0.3	2022-07-22 13:16:57 -07:00
Aidan Dempster	f3d7e226ba	Changed types to be generic instead of functions (#215 ) This allows pylance to do proper type hinting and makes developing extensions to the package much easier	2022-07-22 13:16:29 -07:00
Phil Wang	48a1302428	1.0.2	2022-07-20 23:01:51 -07:00
Aidan Dempster	ccaa46b81b	Re-introduced change that was accidentally rolled back (#212 )	2022-07-20 23:01:19 -07:00
Phil Wang	76d08498cc	diffusion prior training updates from @nousr	2022-07-20 18:05:27 -07:00
zion	f9423d308b	Prior updates (#211 ) * update configs for prior add prior warmup to config update example prior config * update prior trainer & script add deepspeed amp & warmup adopt full accelerator support reload at sample point finish epoch resume code * update tracker save method for prior * helper functions for prior_loader	2022-07-20 18:04:26 -07:00
Phil Wang	06c65b60d2	1.0.0	2022-07-19 19:08:17 -07:00
Aidan Dempster	4145474bab	Improved upsampler training (#181 ) Sampling is now possible without the first decoder unet Non-training unets are deleted in the decoder trainer since they are never used and it is harder merge the models is they have keys in this state dict Fixed a mistake where clip was not re-added after saving	2022-07-19 19:07:50 -07:00
Phil Wang	4b912a38c6	0.26.2	2022-07-19 17:50:36 -07:00
Aidan Dempster	f97e55ec6b	Quality of life improvements for tracker savers (#210 ) The default save location is now none so if keys are not specified the corresponding checkpoint type is not saved. Models and checkpoints are now both saved with version number and the config used to create them in order to simplify loading. Documentation was fixed to be in line with current usage.	2022-07-19 17:50:18 -07:00
Phil Wang	291377bb9c	@jacobwjs reports dynamic thresholding works very well and 0.95 is a better value	2022-07-19 11:31:56 -07:00
Phil Wang	7f120a8b56	cleanup, CLI no longer necessary since Zion + Aidan have https://github.com/LAION-AI/dalle2-laion and colab notebook going	2022-07-19 09:47:44 -07:00
Phil Wang	8c003ab1e1	readme and citation	2022-07-19 09:36:45 -07:00
Phil Wang	723bf0abba	complete inpainting ability using inpaint_image and inpaint_mask passed into sample function for decoder	2022-07-19 09:26:55 -07:00
Phil Wang	d88c7ba56c	fix a bug with ddim and predict x0 objective	2022-07-18 19:04:26 -07:00
Phil Wang	3676a8ce78	comments	2022-07-18 15:02:04 -07:00
Phil Wang	da8e99ada0	fix sample bug	2022-07-18 13:50:22 -07:00
Phil Wang	6afb886cf4	complete imagen-like noise level conditioning	2022-07-18 13:43:57 -07:00
Phil Wang	c7fe4f2f44	project management	2022-07-17 17:27:44 -07:00
Phil Wang	a2ee3fa3cc	offer way to turn off initial cross embed convolutional module, for debugging upsampler artifacts	2022-07-15 17:29:10 -07:00
Phil Wang	a58a370d75	takes care of a grad strides error at https://github.com/lucidrains/DALLE2-pytorch/issues/196 thanks to @YUHANG-Ma	2022-07-14 15:28:34 -07:00
Phil Wang	1662bbf226	protect against random cropping for base unet	2022-07-14 12:49:43 -07:00
Phil Wang	5be1f57448	update	2022-07-14 12:03:42 -07:00
Phil Wang	c52ce58e10	update	2022-07-14 10:54:51 -07:00
Phil Wang	a34f60962a	let the neural network peek at the low resolution conditioning one last time before making prediction, for upsamplers	2022-07-14 10:27:04 -07:00
Phil Wang	0b40cbaa54	just always use nearest neighbor interpolation when resizing for low resolution conditioning, for https://github.com/lucidrains/DALLE2-pytorch/pull/181	2022-07-13 20:59:43 -07:00
Phil Wang	f141144a6d	allow for using classifier free guidance for some unets but not others, by passing in a tuple of cond_scale during sampling for decoder, just in case it is causing issues for upsamplers	2022-07-13 13:12:30 -07:00
Phil Wang	f988207718	hack around some inplace error, also make sure for openai clip text encoding, only tokens after eos_id is masked out	2022-07-13 12:56:02 -07:00
Phil Wang	b2073219f0	foolproof sampling for decoder to always use eval mode (and restore training state afterwards)	2022-07-13 10:21:00 -07:00
Phil Wang	cc0f7a935c	fix non pixel shuffle upsample	2022-07-13 10:16:02 -07:00
Phil Wang	95a512cb65	fix a potential bug with conditioning with blurred low resolution image, blur should be applied only 50% of the time	2022-07-13 10:11:49 -07:00
Phil Wang	972ee973bc	fix issue with ddim and normalization of lowres conditioning image	2022-07-13 09:48:40 -07:00
Phil Wang	79e2a3bc77	only use the stable layernorm for final output norm in transformer	2022-07-13 07:56:30 -07:00
Aidan Dempster	544cdd0b29	Reverted to using basic dataloaders (#205 ) Accelerate removes the ability to collate strings. Likely since it cannot gather strings.	2022-07-12 18:22:27 -07:00
Phil Wang	349aaca56f	add yet another transformer stability measure	2022-07-12 17:49:16 -07:00
Phil Wang	3ee3c56d2a	add learned padding tokens, same strategy as dalle1, for diffusion prior, and get rid of masking in causal transformer	2022-07-12 17:33:14 -07:00
Phil Wang	cd26c6b17d	0.22.3	2022-07-12 17:08:31 -07:00
Phil Wang	775abc4df6	add setting to attend to all text encodings regardless of padding, for diffusion prior	2022-07-12 17:08:12 -07:00
Phil Wang	11b1d533a0	make sure text encodings being passed in has the correct batch dimension	2022-07-12 16:00:19 -07:00
Phil Wang	e76e89f9eb	remove text masking altogether in favor of deriving from text encodings (padded text encodings must be pad value of 0.)	2022-07-12 15:40:31 -07:00
Phil Wang	bb3ff0ac67	protect against bad text mask being passed into decoder	2022-07-12 15:33:13 -07:00
Phil Wang	1ec4dbe64f	one more fix for text mask, if the length of the text encoding exceeds max_text_len, add an assert for better error msg	2022-07-12 15:01:46 -07:00
Phil Wang	e0835acca9	generate text mask within the unet and diffusion prior itself from the text encodings, if not given	2022-07-12 12:54:59 -07:00
Phil Wang	e055793e5d	shoutout for @MalumaDev	2022-07-11 16:12:35 -07:00
Phil Wang	1d9ef99288	add PixelShuffleUpsample thanks to @MalumaDev and @marunine for running the experiment and verifyng absence of checkboard artifacts	2022-07-11 16:07:23 -07:00
Phil Wang	bdd62c24b3	zero init final projection in unet, since openai and @crowsonkb are both doing it	2022-07-11 13:22:06 -07:00
Phil Wang	1f1557c614	make it so even if text mask is omitted, it will be derived based on whether text encodings are all 0s or not, simplify dataloading	2022-07-11 10:56:19 -07:00
Aidan Dempster	1a217e99e3	Unet parameter count is now shown (#202 )	2022-07-10 16:45:59 -07:00
Phil Wang	7ea314e2f0	allow for final l2norm clamping of the sampled image embed	2022-07-10 09:44:38 -07:00
Phil Wang	4173e88121	more accurate readme	2022-07-09 20:57:26 -07:00
Phil Wang	3dae43fa0e	fix misnamed variable, thanks to @nousr	2022-07-09 19:01:37 -07:00
Phil Wang	a598820012	do not noise for the last step in ddim	2022-07-09 18:38:40 -07:00
Phil Wang	4878762627	fix for small validation bug for sampling steps	2022-07-09 17:31:54 -07:00
Phil Wang	47ae17b36e	more informative error for something that tripped me up	2022-07-09 17:28:14 -07:00
Phil Wang	b7e22f7da0	complete ddim integration of diffusion prior as well as decoder for each unet, feature complete for https://github.com/lucidrains/DALLE2-pytorch/issues/157	2022-07-09 17:25:34 -07:00
Romain Beaumont	68de937aac	Fix decoder test by fixing the resizing output size (#197 )	2022-07-09 07:48:07 -07:00
Phil Wang	097afda606	0.18.0	2022-07-08 18:18:38 -07:00
Aidan Dempster	5c520db825	Added deepspeed support (#195 )	2022-07-08 18:18:08 -07:00
Phil Wang	3070610231	just force it so researcher can never pass in an image that is less than the size that is required for CLIP or CoCa	2022-07-08 18:17:29 -07:00
Aidan Dempster	870aeeca62	Fixed issue where evaluation would error when large image was loaded (#194 )	2022-07-08 17:11:34 -07:00
Romain Beaumont	f28dc6dc01	setup simple ci (#193 )	2022-07-08 16:51:56 -07:00
Phil Wang	081d8d3484	0.17.0	2022-07-08 13:36:26 -07:00
Aidan Dempster	a71f693a26	Add the ability to auto restart the last run when started after a crash (#191 ) * Added autoresume after crash functionality to the trackers * Updated documentation * Clarified what goes in the autorestart object * Fixed style issues Unraveled conditional block Chnaged to using helper function to get step count	2022-07-08 13:35:40 -07:00
Phil Wang	d7bc5fbedd	expose num_steps_taken helper method on trainer to retrieve number of training steps of each unet	2022-07-08 13:00:56 -07:00
Phil Wang	8c823affff	allow for control over use of nearest interp method of downsampling low res conditioning, in addition to being able to turn it off	2022-07-08 11:44:43 -07:00
Phil Wang	ec7cab01d9	extra insurance that diffusion prior is on the correct device, when using trainer with accelerator or device was given	2022-07-07 10:08:33 -07:00
Phil Wang	46be8c32d3	fix a potential issue in the low resolution conditioner, when downsampling and then upsampling using resize right, thanks to @marunine	2022-07-07 09:41:49 -07:00
Phil Wang	900f086a6d	fix condition_on_text_encodings in dalle2 orchestrator class, fix readme	2022-07-07 07:43:41 -07:00
zion	b3e646fd3b	add readme for prior (#159 ) * add readme for prior * offload prior info in main readme * typos	2022-07-06 20:50:52 -07:00
Phil Wang	6a59c7093d	more shots in the dark regarding fp16 with learned variance for deepspeed issue	2022-07-06 19:05:50 -07:00
Phil Wang	a6cdbe0b9c	relax learning rate constraint, as @rom1504 wants to try a higher one	2022-07-06 18:09:11 -07:00
Phil Wang	e928ae5c34	default the device to the device that the diffusion prior parameters are on, if the trainer was never given the accelerator nor device	2022-07-06 12:47:48 -07:00
Phil Wang	1bd8a7835a	attempting to fix issue with deepspeed fp16 seeing overflowing gradient	2022-07-06 08:27:34 -07:00
Phil Wang	f33453df9f	debugging with Aidan	2022-07-05 18:22:43 -07:00
Phil Wang	1e4bb2bafb	cast long as float before deriving sinusoidal pos emb	2022-07-05 18:01:22 -07:00
Phil Wang	ee75515c7d	remove forcing of softmax in f32, in case it is interfering with deepspeed	2022-07-05 16:53:58 -07:00
Phil Wang	ec68243479	set ability to do warmup steps for each unet during training	2022-07-05 16:24:16 -07:00
Phil Wang	3afdcdfe86	need to keep track of training steps separately for each unet in decoder trainer	2022-07-05 15:17:59 -07:00
Phil Wang	b9a908ff75	bring in two tricks from the cogview paper for reducing the chances of overflow, for attention and layernorm	2022-07-05 14:27:04 -07:00
Phil Wang	e1fe3089df	do bias-less layernorm manually	2022-07-05 13:09:58 -07:00
Phil Wang	6d477d7654	link to dalle2 laion	2022-07-05 11:43:07 -07:00
Phil Wang	531fe4b62f	status	2022-07-05 10:46:55 -07:00
Phil Wang	ec5a77fc55	0.15.4	2022-07-02 08:56:34 -07:00
Aidan Dempster	fac63c61bc	Fixed variable naming issue (#183 )	2022-07-02 08:56:03 -07:00
Phil Wang	3d23ba4aa5	add ability to specify full self attention on specific stages in the unet	2022-07-01 10:22:07 -07:00
Phil Wang	282c35930f	0.15.2	2022-07-01 09:40:11 -07:00
Aidan Dempster	27b0f7ca0d	Overhauled the tracker system (#172 ) * Overhauled the tracker system Separated the logging and saving capabilities Changed creation to be consistent and initializing behavior to be defined by a class initializer instead of in the training script Added class separation between different types of loaders and savers to make the system more verbose * Changed the saver system to only save the checkpoint once * Added better error handling for saving checkpoints * Fixed an error where wandb would error when passed arbitrary kwargs * Fixed variable naming issues for improved saver Added more logging during long pauses * Fixed which methods need to be dummy to immediatly return Added the ability to set whether you find unused parameters * Added more logging for when a wandb loader fails	2022-07-01 09:39:40 -07:00
Phil Wang	7b0edf9e42	allow for returning low resolution conditioning image on forward through decoder with return_lowres_cond_image flag	2022-07-01 09:35:39 -07:00
Phil Wang	a922a539de	bring back convtranspose2d upsampling, allow for nearest upsample with hyperparam, change kernel size of last conv to 1, make configurable, cleanup	2022-07-01 09:21:47 -07:00
Phil Wang	8f2466f1cd	blur sigma for upsampling training was 0.6 in the paper, make that the default value	2022-06-30 17:03:16 -07:00
Phil Wang	908ab83799	add skip connections for all intermediate resnet blocks, also add an extra resnet block for memory efficient version of unet, time condition for both initial resnet block and last one before output	2022-06-29 08:16:58 -07:00
Phil Wang	46a2558d53	bug in pydantic decoder config class	2022-06-29 07:17:35 -07:00
yytdfc	86109646e3	fix a bug of name error (#179 )	2022-06-29 07:16:44 -07:00
Phil Wang	6a11b9678b	bring in the skip connection scaling factor, used by imagen in their unets, cite original paper using it	2022-06-26 21:59:55 -07:00
Phil Wang	b90364695d	fix remaining issues with deriving cond_on_text_encodings from child unet settings	2022-06-26 21:07:42 -07:00
zion	868c001199	bug fixes for text conditioning update (#175 )	2022-06-26 16:12:32 -07:00
Phil Wang	032e83b0e0	nevermind, do not enforce text encodings on first unet	2022-06-26 12:45:05 -07:00
Phil Wang	2e85e736f3	remove unnecessary decoder setting, and if not unconditional, always make sure the first unet is condition-able on text	2022-06-26 12:32:17 -07:00
Aidan Dempster	f5760bdb92	Add data flexibility to decoder trainer (#165 ) * Added the ability to train decoder with text embeddings * Added the ability to train using on the fly generated embeddings with clip * Clip now generates embeddings for whatever is not precomputed	2022-06-25 19:05:20 -07:00
zion	c453f468b1	autoswitch tqdm for notebooks (#171 ) avoids printing the `tqdm` progress bar to a newline in notebooks when detected	2022-06-25 16:37:06 -07:00
zion	98f0c17759	add sampels-seen and ema decay (#166 )	2022-06-24 15:12:09 -07:00
Phil Wang	a5b9fd6ca8	product management	2022-06-24 08:15:05 -07:00
Phil Wang	4b994601ae	just make sure decoder learning rate is reasonable and help out budding researchers	2022-06-23 11:29:28 -07:00
zion	fddf66e91e	fix params in decoder (#162 )	2022-06-22 14:45:01 -07:00