0.2.3

some cleanup
fix a bug with numerical stability in attention, sorry! 🐛
2026-02-12 11:34:29 +01:00 · 2022-05-09 16:50:31 -07:00 · 2022-05-09 16:50:21 -07:00 · 2022-05-09 16:23:37 -07:00 · 2022-05-09 16:05:40 -07:00 · 2022-05-09 13:57:15 -07:00
5 changed files with 242 additions and 95 deletions
--- a/README.md
+++ b/README.md
@@ -927,7 +927,39 @@ The most significant parameters for the script are as follows:

 ### Sample wandb run log

-Please find a sample wandb run log at : https://wandb.ai/laion/diffusion-prior/runs/aul0rhv5?workspace=
+Please find a sample wandb run log at : https://wandb.ai/laion/diffusion-prior/runs/1blxu24j
+
+### Loading and saving the Diffusion Prior model
+
+Two methods are provided, load_diffusion_model and save_diffusion_model, the names being self-explanatory. 
+
+## from dalle2_pytorch.train import load_diffusion_model, save_diffusion_model
+
+    load_diffusion_model(dprior_path, device) 
+
+        dprior_path : path to saved model(.pth)
+    
+        device      : the cuda device you're running on
+    
+    save_diffusion_model(save_path, model, optimizer, scaler, config, image_embed_dim)
+    
+        save_path : path to save at
+    
+        model     : object of Diffusion_Prior
+    
+        optimizer : optimizer object - see train_diffusion_prior.py for how to create one. 
+    
+            e.g: optimizer = get_optimizer(diffusion_prior.net.parameters(), wd=weight_decay, lr=learning_rate)
+    
+        scaler    : a GradScaler object.
+    
+            e.g: scaler = GradScaler(enabled=amp)
+    
+        config    : config object created in train_diffusion_prior.py - see file for example. 
+    
+        image_embed_dim - the dimension of the image_embedding
+    
+            e.g: 768

 ## CLI (wip)

@@ -966,6 +998,7 @@ Once built, images will be saved to the same directory the command is invoked
 - [x] add convnext backbone for vqgan-vae (in addition to vit [vit-vqgan] + resnet)
 - [x] make sure DDPMs can be run with traditional resnet blocks (but leave convnext as an option for experimentation)
 - [x] make sure for the latter unets in the cascade, one can train on crops for learning super resolution (constrain the unet to be only convolutions in that case, or allow conv-like attention with rel pos bias)
+- [x] offer setting in diffusion prior to split time and image embeddings into multiple tokens, configurable, for more surface area during attention
 - [ ] become an expert with unets, cleanup unet code, make it fully configurable, port all learnings over to https://github.com/lucidrains/x-unet (test out unet² in ddpm repo) - consider https://github.com/lucidrains/uformer-pytorch attention-based unet
 - [ ] make sure the cascading ddpm in the repository can be trained unconditionally, offer a one-line CLI tool for training on a folder of images
 - [ ] transcribe code to Jax, which lowers the activation energy for distributed training, given access to TPUs
@@ -981,6 +1014,7 @@ Once built, images will be saved to the same directory the command is invoked
 - [ ] make sure FILIP works with DALL-E2 from x-clip https://arxiv.org/abs/2111.07783
 - [ ] make sure resnet hyperparameters can be configurable across unet depth (groups and expansion factor)
 - [ ] offer save / load methods on the trainer classes to automatically take care of state dicts for scalers / optimizers / saving versions and checking for breaking changes
+- [ ] bring in skip-layer excitatons (from lightweight gan paper) to see if it helps for either decoder of unet or vqgan-vae training

 ## Citations

--- a/dalle2_pytorch/dalle2_pytorch.py
+++ b/dalle2_pytorch/dalle2_pytorch.py
@@ -4,6 +4,8 @@ from inspect import isfunction
 from functools import partial
 from contextlib import contextmanager
 from collections import namedtuple
+from pathlib import Path
+import time

 import torch
 import torch.nn.functional as F
@@ -639,7 +641,7 @@ class Attention(nn.Module):

        # attention

-        sim = sim - sim.amax(dim = -1, keepdim = True)
+        sim = sim - sim.amax(dim = -1, keepdim = True).detach()
        attn = sim.softmax(dim = -1)
        attn = self.dropout(attn)

@@ -703,10 +705,31 @@ class DiffusionPriorNetwork(nn.Module):
        self,
        dim,
        num_timesteps = None,
+        num_time_embeds = 1,
+        num_image_embeds = 1,
+        num_text_embeds = 1,
        **kwargs
    ):
        super().__init__()
-        self.time_embeddings = nn.Embedding(num_timesteps, dim) if exists(num_timesteps) else nn.Sequential(Rearrange('b -> b 1'), MLP(1, dim)) # also offer a continuous version of timestep embeddings, with a 2 layer MLP
+        self.num_time_embeds = num_time_embeds
+        self.num_image_embeds = num_image_embeds
+        self.num_text_embeds = num_text_embeds
+
+        self.to_text_embeds = nn.Sequential(
+            nn.Linear(dim, dim * num_text_embeds) if num_text_embeds > 1 else nn.Identity(),
+            Rearrange('b (n d) -> b n d', n = num_text_embeds)
+        )
+
+        self.to_time_embeds = nn.Sequential(
+            nn.Embedding(num_timesteps, dim * num_time_embeds) if exists(num_timesteps) else nn.Sequential(SinusoidalPosEmb(dim), MLP(dim, dim * num_time_embeds)), # also offer a continuous version of timestep embeddings, with a 2 layer MLP
+            Rearrange('b (n d) -> b n d', n = num_time_embeds)
+        )
+
+        self.to_image_embeds = nn.Sequential(
+            nn.Linear(dim, dim * num_image_embeds) if num_image_embeds > 1 else nn.Identity(),
+            Rearrange('b (n d) -> b n d', n = num_image_embeds)
+        )
+
        self.learned_query = nn.Parameter(torch.randn(dim))
        self.causal_transformer = CausalTransformer(dim = dim, **kwargs)

@@ -736,10 +759,13 @@ class DiffusionPriorNetwork(nn.Module):
    ):
        batch, dim, device, dtype = *image_embed.shape, image_embed.device, image_embed.dtype

+        num_time_embeds, num_image_embeds, num_text_embeds = self.num_time_embeds, self.num_image_embeds, self.num_text_embeds
+
        # in section 2.2, last paragraph
        # "... consisting of encoded text, CLIP text embedding, diffusion timestep embedding, noised CLIP image embedding, final embedding for prediction"

-        text_embed, image_embed = rearrange_many((text_embed, image_embed), 'b d -> b 1 d')
+        text_embed = self.to_text_embeds(text_embed)
+        image_embed = self.to_image_embeds(image_embed)

        # make text encodings optional
        # although the paper seems to suggest it is present <--
@@ -759,16 +785,17 @@ class DiffusionPriorNetwork(nn.Module):

        # whether text embedding is masked or not depends on the classifier free guidance conditional masking

+        keep_mask = repeat(keep_mask, 'b 1 -> b n', n = num_text_embeds)
        mask = torch.cat((mask, keep_mask), dim = 1)

        # whether text embedding is used for conditioning depends on whether text encodings are available for attention (for classifier free guidance, even though it seems from the paper it was not used in the prior ddpm, as the objective is different)
        # but let's just do it right

        if exists(mask):
-            mask = F.pad(mask, (0, 3), value = True) # extend mask for text embedding, noised image embedding, time step embedding, and learned query
+            attend_padding = 1 + num_time_embeds + num_image_embeds # 1 for learned queries + number of image embeds + time embeds
+            mask = F.pad(mask, (0, attend_padding), value = True) # extend mask for text embedding, noised image embedding, time step embedding, and learned query

-        time_embed = self.time_embeddings(diffusion_timesteps)
-        time_embed = rearrange(time_embed, 'b d -> b 1 d')
+        time_embed = self.to_time_embeds(diffusion_timesteps)

        learned_queries = repeat(self.learned_query, 'd -> b 1 d', b = batch)

@@ -800,13 +827,14 @@ class DiffusionPrior(BaseGaussianDiffusion):
        image_size = None,
        image_channels = 3,
        timesteps = 1000,
-        cond_drop_prob = 0.2,
+        cond_drop_prob = 0.,
        loss_type = "l1",
        predict_x_start = True,
        beta_schedule = "cosine",
        condition_on_text_encodings = True, # the paper suggests this is needed, but you can turn it off for your CLIP preprocessed text embed -> image embed training
        sampling_clamp_l2norm = False,
        training_clamp_l2norm = False,
+        init_image_embed_l2norm = False,
        image_embed_scale = None,           # this is for scaling the l2-normed image embedding, so it is more suitable for gaussian diffusion, as outlined by Katherine (@crowsonkb) https://github.com/lucidrains/DALLE2-pytorch/issues/60#issue-1226116132
        clip_adapter_overrides = dict()
    ):
@@ -845,6 +873,7 @@ class DiffusionPrior(BaseGaussianDiffusion):
        # whether to force an l2norm, similar to clipping denoised, when sampling
        self.sampling_clamp_l2norm = sampling_clamp_l2norm
        self.training_clamp_l2norm = training_clamp_l2norm
+        self.init_image_embed_l2norm = init_image_embed_l2norm

    def p_mean_variance(self, x, t, text_cond, clip_denoised: bool):
        pred = self.net(x, t, **text_cond)
@@ -879,11 +908,16 @@ class DiffusionPrior(BaseGaussianDiffusion):
        device = self.betas.device

        b = shape[0]
-        img = torch.randn(shape, device=device)
+        image_embed = torch.randn(shape, device=device)
+
+        if self.init_image_embed_l2norm:
+            image_embed = l2norm(image_embed) * self.image_embed_scale

        for i in tqdm(reversed(range(0, self.num_timesteps)), desc='sampling loop time step', total=self.num_timesteps):
-            img = self.p_sample(img, torch.full((b,), i, device = device, dtype = torch.long), text_cond = text_cond)
-        return img
+            times = torch.full((b,), i, device = device, dtype = torch.long)
+            image_embed = self.p_sample(image_embed, times, text_cond = text_cond)
+
+        return image_embed

    def p_losses(self, image_embed, times, text_cond, noise = None):
        noise = default(noise, lambda: torch.randn_like(image_embed))
@@ -1134,7 +1168,7 @@ class CrossAttention(nn.Module):
            mask = rearrange(mask, 'b j -> b 1 1 j')
            sim = sim.masked_fill(~mask, max_neg_value)

-        sim = sim - sim.amax(dim = -1, keepdim = True)
+        sim = sim - sim.amax(dim = -1, keepdim = True).detach()
        attn = sim.softmax(dim = -1)

        out = einsum('b h i j, b h j d -> b h i d', attn, v)
@@ -1890,3 +1924,4 @@ class DALLE2(nn.Module):
            return images[0]

        return images
+
--- a/dalle2_pytorch/train.py
+++ b/dalle2_pytorch/train.py
@@ -39,6 +39,50 @@ def groupby_prefix_and_trim(prefix, d):
    kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
    return kwargs_without_prefix, kwargs

+# print helpers
+
+def print_ribbon(s, symbol = '=', repeat = 40):
+    flank = symbol * repeat
+    return f'{flank} {s} {flank}'
+
+# saving and loading functions
+
+# for diffusion prior
+
+def load_diffusion_model(dprior_path, device):
+    dprior_path = Path(dprior_path)
+    assert dprior_path.exists(), 'Dprior model file does not exist'
+    loaded_obj = torch.load(str(dprior_path), map_location='cpu')
+
+    # Get hyperparameters of loaded model
+    dpn_config = loaded_obj['hparams']['diffusion_prior_network']
+    dp_config = loaded_obj['hparams']['diffusion_prior']
+    image_embed_dim = loaded_obj['image_embed_dim']['image_embed_dim']
+
+    # Create DiffusionPriorNetwork and DiffusionPrior with loaded hyperparameters
+
+    # DiffusionPriorNetwork
+    prior_network = DiffusionPriorNetwork( dim = image_embed_dim, **dpn_config).to(device)
+
+    # DiffusionPrior with text embeddings and image embeddings pre-computed
+    diffusion_prior = DiffusionPrior(net = prior_network, **dp_config, image_embed_dim = image_embed_dim).to(device)
+
+    # Load state dict from saved model
+    diffusion_prior.load_state_dict(loaded_obj['model'])
+
+    return diffusion_prior
+
+def save_diffusion_model(save_path, model, optimizer, scaler, config, image_embed_dim):
+    # Saving State Dict
+    print_ribbon('Saving checkpoint')
+
+    state_dict = dict(model=model.state_dict(),
+                      optimizer=optimizer.state_dict(),
+                      scaler=scaler.state_dict(),
+                      hparams = config,
+                      image_embed_dim = {"image_embed_dim":image_embed_dim})
+    torch.save(state_dict, save_path+'/'+str(time.time())+'_saved_model.pth')
+
 # exponential moving average wrapper

 class EMA(nn.Module):
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ setup(
      'dream = dalle2_pytorch.cli:dream'
    ],
  },
-  version = '0.1.6',
+  version = '0.2.3',
  license='MIT',
  description = 'DALL-E 2',
  author = 'Phil Wang',
--- a/train_diffusion_prior.py
+++ b/train_diffusion_prior.py
@@ -7,6 +7,7 @@ import torch
 from torch import nn
 from embedding_reader import EmbeddingReader
 from dalle2_pytorch import DiffusionPrior, DiffusionPriorNetwork
+from dalle2_pytorch.train import load_diffusion_model, save_diffusion_model, print_ribbon
 from dalle2_pytorch.optimizer import get_optimizer
 from torch.cuda.amp import autocast,GradScaler

@@ -41,69 +42,56 @@ def eval_model(model,device,image_reader,text_reader,start,end,batch_size,loss_t
        avg_loss = (total_loss / total_samples)
        wandb.log({f'{phase} {loss_type}': avg_loss})

-def save_model(save_path, state_dict):
-    # Saving State Dict
-    print("====================================== Saving checkpoint ======================================")
-    torch.save(state_dict, save_path+'/'+str(time.time())+'_saved_model.pth')
+def report_cosine_sims(diffusion_prior,image_reader,text_reader,train_set_size,NUM_TEST_EMBEDDINGS,device):
+    diffusion_prior.eval()

-
-def report_cosine_sims(diffusion_prior, image_reader, text_reader, train_set_size, val_set_size, NUM_TEST_EMBEDDINGS, device):
    cos = nn.CosineSimilarity(dim=1, eps=1e-6)

-    tstart = train_set_size+val_set_size
-    tend = train_set_size+val_set_size+NUM_TEST_EMBEDDINGS
-
-    for embt, embi in zip(text_reader(batch_size=NUM_TEST_EMBEDDINGS, start=tstart, end=tend), image_reader(batch_size=NUM_TEST_EMBEDDINGS, start=tstart, end=tend)):
-        # make a copy of the text embeddings for shuffling
-        text_embed = torch.tensor(embt[0]).to(device)
-        text_embed_shuffled = text_embed.clone()
+    tstart = train_set_size
+    tend = train_set_size+NUM_TEST_EMBEDDINGS

+    for embt, embi in zip(text_reader(batch_size=NUM_TEST_EMBEDDINGS, start=tstart, end=tend), 
+            image_reader(batch_size=NUM_TEST_EMBEDDINGS, start=tstart, end=tend)):
+       # make a copy of the text embeddings for shuffling
+       text_embed = torch.tensor(embt[0]).to(device)
+       text_embed_shuffled = text_embed.clone()
        # roll the text embeddings to simulate "unrelated" captions
-        rolled_idx = torch.roll(torch.arange(NUM_TEST_EMBEDDINGS), 1)
-        text_embed_shuffled = text_embed_shuffled[rolled_idx]
-        text_embed_shuffled = text_embed_shuffled / \
-            text_embed_shuffled.norm(dim=1, keepdim=True)
-        test_text_shuffled_cond = dict(text_embed=text_embed_shuffled)
-
+       rolled_idx = torch.roll(torch.arange(NUM_TEST_EMBEDDINGS), 1)
+       text_embed_shuffled = text_embed_shuffled[rolled_idx]
+       text_embed_shuffled = text_embed_shuffled / \
+           text_embed_shuffled.norm(dim=1, keepdim=True)
+       test_text_shuffled_cond = dict(text_embed=text_embed_shuffled)
        # prepare the text embedding
-        text_embed = text_embed / text_embed.norm(dim=1, keepdim=True)
-        test_text_cond = dict(text_embed=text_embed)
-
+       text_embed = text_embed / text_embed.norm(dim=1, keepdim=True)
+       test_text_cond = dict(text_embed=text_embed)
        # prepare image embeddings
-        test_image_embeddings = torch.tensor(embi[0]).to(device)
-        test_image_embeddings = test_image_embeddings / \
-            test_image_embeddings.norm(dim=1, keepdim=True)
-
+       test_image_embeddings = torch.tensor(embi[0]).to(device)
+       test_image_embeddings = test_image_embeddings / \
+           test_image_embeddings.norm(dim=1, keepdim=True)
        # predict on the unshuffled text embeddings
-        predicted_image_embeddings = diffusion_prior.p_sample_loop(
-            (NUM_TEST_EMBEDDINGS, 768), text_cond=test_text_cond)
-        predicted_image_embeddings = predicted_image_embeddings / \
-            predicted_image_embeddings.norm(dim=1, keepdim=True)
-
+       predicted_image_embeddings = diffusion_prior.p_sample_loop(
+           (NUM_TEST_EMBEDDINGS, 768), text_cond=test_text_cond)
+       predicted_image_embeddings = predicted_image_embeddings / \
+           predicted_image_embeddings.norm(dim=1, keepdim=True)
        # predict on the shuffled embeddings
-        predicted_unrelated_embeddings = diffusion_prior.p_sample_loop(
-            (NUM_TEST_EMBEDDINGS, 768), text_cond=test_text_shuffled_cond)
-        predicted_unrelated_embeddings = predicted_unrelated_embeddings / \
-            predicted_unrelated_embeddings.norm(dim=1, keepdim=True)
-
+       predicted_unrelated_embeddings = diffusion_prior.p_sample_loop(
+           (NUM_TEST_EMBEDDINGS, 768), text_cond=test_text_shuffled_cond)
+       predicted_unrelated_embeddings = predicted_unrelated_embeddings / \
+           predicted_unrelated_embeddings.norm(dim=1, keepdim=True)
        # calculate similarities
-        original_similarity = cos(
-            text_embed, test_image_embeddings).cpu().numpy()
-        predicted_similarity = cos(
-            text_embed, predicted_image_embeddings).cpu().numpy()
-        unrelated_similarity = cos(
-            text_embed, predicted_unrelated_embeddings).cpu().numpy()
-
-        wandb.log(
-            {"CosineSimilarity(text_embed,image_embed)": np.mean(original_similarity)})
-        wandb.log({"CosineSimilarity(text_embed,predicted_image_embed)": np.mean(
-            predicted_similarity)})
-        wandb.log({"CosineSimilarity(text_embed,predicted_unrelated_embed)": np.mean(
-            unrelated_similarity)})
-
-    return np.mean(predicted_similarity - original_similarity)
-
-
+       original_similarity = cos(
+           text_embed, test_image_embeddings).cpu().numpy()
+       predicted_similarity = cos(
+           text_embed, predicted_image_embeddings).cpu().numpy()
+       unrelated_similarity = cos(
+           text_embed, predicted_unrelated_embeddings).cpu().numpy()
+       predicted_img_similarity = cos(
+           test_image_embeddings, predicted_image_embeddings).cpu().numpy()
+       wandb.log({"CosineSimilarity(text_embed,image_embed)": np.mean(original_similarity),
+            "CosineSimilarity(text_embed,predicted_image_embed)":np.mean(predicted_similarity),
+            "CosineSimilarity(orig_image_embed,predicted_image_embed)":np.mean(predicted_img_similarity),
+            "CosineSimilarity(text_embed,predicted_unrelated_embed)": np.mean(unrelated_similarity),
+            "Cosine similarity difference":np.mean(predicted_similarity - original_similarity)})

 def train(image_embed_dim,
          image_embed_url,
@@ -125,9 +113,15 @@ def train(image_embed_dim,
          save_interval,
          save_path,
          device,
+          RESUME,
+          DPRIOR_PATH,
+          config,
+          wandb_entity,
+          wandb_project,
          learning_rate=0.001,
          max_grad_norm=0.5,
          weight_decay=0.01,
+          dropout=0.05,
          amp=False):

    # DiffusionPriorNetwork 
@@ -136,6 +130,8 @@ def train(image_embed_dim,
            depth = dpn_depth, 
            dim_head = dpn_dim_head, 
            heads = dpn_heads,
+            attn_dropout = dropout,
+            ff_dropout = dropout,
            normformer = dp_normformer).to(device)
    
    # DiffusionPrior with text embeddings and image embeddings pre-computed
@@ -148,16 +144,21 @@ def train(image_embed_dim,
            loss_type = dp_loss_type, 
            condition_on_text_encodings = dp_condition_on_text_encodings).to(device)

-    # Get image and text embeddings from the servers
-    print("==============Downloading embeddings - image and text====================")
-    image_reader = EmbeddingReader(embeddings_folder=image_embed_url, file_format="npy")
-    text_reader  = EmbeddingReader(embeddings_folder=text_embed_url, file_format="npy")
-    num_data_points = text_reader.count
+    # Load pre-trained model from DPRIOR_PATH
+    if RESUME:
+        diffusion_prior=load_diffusion_model(DPRIOR_PATH,device)   
+        wandb.init( entity=wandb_entity, project=wandb_project, config=config) 

    # Create save_path if it doesn't exist
    if not os.path.exists(save_path):
        os.makedirs(save_path)

+    # Get image and text embeddings from the servers
+    print_ribbon("Downloading embeddings - image and text")
+    image_reader = EmbeddingReader(embeddings_folder=image_embed_url, file_format="npy")
+    text_reader  = EmbeddingReader(embeddings_folder=text_embed_url, file_format="npy")
+    num_data_points = text_reader.count
+
    ### Training code ###
    scaler = GradScaler(enabled=amp)
    optimizer = get_optimizer(diffusion_prior.net.parameters(), wd=weight_decay, lr=learning_rate)
@@ -168,12 +169,15 @@ def train(image_embed_dim,

    train_set_size = int(train_percent*num_data_points)
    val_set_size = int(val_percent*num_data_points)
+    eval_start = train_set_size

    for _ in range(epochs):
-        diffusion_prior.train()

        for emb_images,emb_text in zip(image_reader(batch_size=batch_size, start=0, end=train_set_size),
                text_reader(batch_size=batch_size, start=0, end=train_set_size)):
+
+            diffusion_prior.train()
+            
            emb_images_tensor = torch.tensor(emb_images[0]).to(device)
            emb_text_tensor = torch.tensor(emb_text[0]).to(device)

@@ -188,9 +192,13 @@ def train(image_embed_dim,
            if(int(time.time()-t) >= 60*save_interval):
                t = time.time()

-                save_model(
+                save_diffusion_model(
                    save_path,
-                    dict(model=diffusion_prior.state_dict(), optimizer=optimizer.state_dict(), scaler=scaler.state_dict()))
+                    diffusion_prior,
+                    optimizer,
+                    scaler,
+                    config,
+                    image_embed_dim)

            # Log to wandb
            wandb.log({"Training loss": loss.item(),
@@ -200,14 +208,22 @@ def train(image_embed_dim,
            # Use NUM_TEST_EMBEDDINGS samples from the test set each time
            # Get embeddings from the most recently saved model
            if(step % REPORT_METRICS_EVERY) == 0:
-                diff_cosine_sim = report_cosine_sims(diffusion_prior,
+                report_cosine_sims(diffusion_prior,
                        image_reader,
                        text_reader,
                        train_set_size,
-                        val_set_size,
                        NUM_TEST_EMBEDDINGS,
                        device)
-                wandb.log({"Cosine similarity difference": diff_cosine_sim})
+                ### Evaluate model(validation run) ###
+                eval_model(diffusion_prior,
+                        device,
+                        image_reader,
+                        text_reader,
+                        eval_start,
+                        eval_start+NUM_TEST_EMBEDDINGS,
+                        NUM_TEST_EMBEDDINGS,
+                        dp_loss_type,
+                        phase="Validation")

            scaler.unscale_(optimizer)
            nn.utils.clip_grad_norm_(diffusion_prior.parameters(), max_grad_norm)
@@ -216,11 +232,6 @@ def train(image_embed_dim,
            scaler.update()
            optimizer.zero_grad()

-        ### Evaluate model(validation run) ###
-        start = train_set_size
-        end=start+val_set_size
-        eval_model(diffusion_prior,device,image_reader,text_reader,start,end,batch_size,dp_loss_type,phase="Validation")
-
    ### Test run ###
    test_set_size = int(test_percent*train_set_size) 
    start=train_set_size+val_set_size
@@ -232,7 +243,6 @@ def main():
    # Logging
    parser.add_argument("--wandb-entity", type=str, default="laion")
    parser.add_argument("--wandb-project", type=str, default="diffusion-prior")
-    parser.add_argument("--wandb-name", type=str, default="laion-dprior")
    parser.add_argument("--wandb-dataset", type=str, default="LAION-5B")
    parser.add_argument("--wandb-arch", type=str, default="DiffusionPrior")
    # URLs for embeddings 
@@ -241,6 +251,7 @@ def main():
    # Hyperparameters
    parser.add_argument("--learning-rate", type=float, default=1.1e-4)
    parser.add_argument("--weight-decay", type=float, default=6.02e-2)
+    parser.add_argument("--dropout", type=float, default=5e-2)
    parser.add_argument("--max-grad-norm", type=float, default=0.5)
    parser.add_argument("--batch-size", type=int, default=10**4)
    parser.add_argument("--num-epochs", type=int, default=5)
@@ -258,7 +269,6 @@ def main():
    # DiffusionPrior(dp) parameters
    parser.add_argument("--dp-condition-on-text-encodings", type=bool, default=False)
    parser.add_argument("--dp-timesteps", type=int, default=100)
-    parser.add_argument("--dp-l2norm-output", type=bool, default=False)
    parser.add_argument("--dp-normformer", type=bool, default=False)
    parser.add_argument("--dp-cond-drop-prob", type=float, default=0.1)
    parser.add_argument("--dp-loss-type", type=str, default="l2")
@@ -267,22 +277,40 @@ def main():
    # Model checkpointing interval(minutes)
    parser.add_argument("--save-interval", type=int, default=30)
    parser.add_argument("--save-path", type=str, default="./diffusion_prior_checkpoints")
+    # Saved model path 
+    parser.add_argument("--pretrained-model-path", type=str, default=None)

    args = parser.parse_args()

-    print("Setting up wandb logging... Please wait...")
+    config = ({"learning_rate": args.learning_rate,
+        "architecture": args.wandb_arch,
+        "dataset": args.wandb_dataset,
+        "weight_decay":args.weight_decay,
+        "max_gradient_clipping_norm":args.max_grad_norm,
+        "batch_size":args.batch_size,
+        "epochs": args.num_epochs,
+        "diffusion_prior_network":{"depth":args.dpn_depth,
+        "dim_head":args.dpn_dim_head,
+        "heads":args.dpn_heads,
+        "normformer":args.dp_normformer},
+        "diffusion_prior":{"condition_on_text_encodings": args.dp_condition_on_text_encodings,
+        "timesteps": args.dp_timesteps,
+        "cond_drop_prob":args.dp_cond_drop_prob,
+        "loss_type":args.dp_loss_type,
+        "clip":args.clip}
+        })

-    wandb.init(
-      entity=args.wandb_entity,
-      project=args.wandb_project,
-      config={
-      "learning_rate": args.learning_rate,
-      "architecture": args.wandb_arch,
-      "dataset": args.wandb_dataset,
-      "epochs": args.num_epochs,
-      })
+    RESUME = False
+    # Check if DPRIOR_PATH exists(saved model path)
+    DPRIOR_PATH = args.pretrained_model_path
+    if(DPRIOR_PATH is not None):
+        RESUME = True
+    else:
+        wandb.init(
+          entity=args.wandb_entity,
+          project=args.wandb_project,
+          config=config)

-    print("wandb logging setup done!")
    # Obtain the utilized device.

    has_cuda = torch.cuda.is_available()
@@ -311,9 +339,15 @@ def main():
          args.save_interval,
          args.save_path,
          device,
+          RESUME,
+          DPRIOR_PATH,
+          config,
+          atgs.wandb_entity,
+          args.wandb_project,
          args.learning_rate,
          args.max_grad_norm,
          args.weight_decay,
+          args.dropout,
          args.amp)

 if __name__ == "__main__":
Author	SHA1	Message	Date
Phil Wang	ba64ea45cc	0.2.3	2022-05-09 16:50:31 -07:00
Phil Wang	64f7be1926	some cleanup	2022-05-09 16:50:21 -07:00
Phil Wang	db805e73e1	fix a bug with numerical stability in attention, sorry! 🐛	2022-05-09 16:23:37 -07:00
z	cb07b37970	Ensure Eval Mode In Metric Functions (#79 ) * add eval/train toggles * train/eval flags * shift train toggle Co-authored-by: nousr <z@localhost.com>	2022-05-09 16:05:40 -07:00
Phil Wang	a774bfefe2	add attention and feedforward dropouts to train_diffusion_prior script	2022-05-09 13:57:15 -07:00
Phil Wang	2ae57f0cf5	cleanup	2022-05-09 13:51:26 -07:00
Phil Wang	e46eaec817	deal the diffusion prior problem yet another blow	2022-05-09 11:08:52 -07:00
Kumar R	8647cb5e76	Val loss changes, with quite a few other changes. This is in place of the earlier PR(https://github.com/lucidrains/DALLE2-pytorch/pull/67 ) (#77 ) * Val_loss changes - no rebased with lucidrains' master. * Val Loss changes - now rebased with lucidrains' master * train_diffusion_prior.py updates * dalle2_pytorch.py updates * __init__.py changes * Update train_diffusion_prior.py * Update dalle2_pytorch.py * Update train_diffusion_prior.py * Update train_diffusion_prior.py * Update dalle2_pytorch.py * Update train_diffusion_prior.py * Update train_diffusion_prior.py * Update train_diffusion_prior.py * Update train_diffusion_prior.py * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md	2022-05-09 08:53:29 -07:00
Phil Wang	53c189e46a	give more surface area for attention in diffusion prior	2022-05-09 08:08:11 -07:00
Phil Wang	dde51fd362	revert restriction for classifier free guidance for diffusion prior, given @crowsonkb advice	2022-05-07 20:55:41 -07:00
Nasir Khalid	2eac7996fa	Additional image_embed metric (#75 ) Added metric to track image_embed vs predicted_image_embed	2022-05-07 14:32:33 -07:00
Phil Wang	4010aec033	turn off classifier free guidance if predicting x_start for diffusion prior	2022-05-07 09:38:17 -07:00
Phil Wang	c87b84a259	todo	2022-05-07 09:21:08 -07:00
Phil Wang	8b05468653	todo	2022-05-07 08:33:45 -07:00
Phil Wang	830afd3c15	sinusoidal embed time embeddings for diffusion prior as well, for continuous version	2022-05-07 08:32:43 -07:00
Phil Wang	8f93729d19	when in doubt, make it a hyperparameter	2022-05-07 07:52:17 -07:00