Commit Graph

253 Commits

Author SHA1 Message Date
Phil Wang
b8cf1e5c20 more attention 2022-05-01 11:00:33 -07:00
Phil Wang
5e421bd5bb let researchers do the hyperparameter search 2022-05-01 08:46:21 -07:00
Phil Wang
67fcab1122 add MLP based time conditioning to all convnexts, in addition to cross attention. also add an initial convolution, given convnext first depthwise conv 2022-05-01 08:41:02 -07:00
Phil Wang
d1a697ac23 allows one to shortcut sampling at a specific unet number, if one were to be training in stages 2022-04-30 16:05:13 -07:00
Phil Wang
a9421f49ec simplify Decoder training for the public 2022-04-30 11:45:18 -07:00
Phil Wang
77fa34eae9 fix all clipping / clamping issues 2022-04-30 10:08:24 -07:00
Phil Wang
1c1e508369 fix all issues with text encodings conditioning in the decoder, using null padding tokens technique from dalle v1 2022-04-30 09:13:34 -07:00
Phil Wang
f19c99ecb0 fix decoder needing separate conditional dropping probabilities for image embeddings and text encodings, thanks to @xiankgx ! 2022-04-30 08:48:05 -07:00
Phil Wang
20e7eb5a9b cleanup 2022-04-30 07:22:57 -07:00
Phil Wang
e2f9615afa use @clip-anytorch , thanks to @rom1504 2022-04-30 06:40:54 -07:00
Phil Wang
0d1c07c803 fix a bug with classifier free guidance, thanks to @xiankgx again! 2022-04-30 06:34:57 -07:00
Phil Wang
5063d192b6 now completely OpenAI CLIP compatible for training
just take care of the logic for AdamW and transformers

used namedtuples for clip adapter embedding outputs
2022-04-29 13:05:01 -07:00
Phil Wang
fb662a62f3 fix another bug thanks to @xiankgx 2022-04-29 07:38:32 -07:00
Phil Wang
587c8c9b44 optimize for clarity 2022-04-28 21:59:13 -07:00
Phil Wang
aa900213e7 force first unet in the cascade to be conditioned on image embeds 2022-04-28 20:53:15 -07:00
Phil Wang
625ce23f6b 🐛 2022-04-28 07:21:18 -07:00
Phil Wang
dbf4a281f1 make sure another CLIP can actually be passed in, as long as it is wrapped in an adapter extended from BaseClipAdapter 2022-04-27 20:45:27 -07:00
Phil Wang
4ab527e779 some extra asserts for text encoding of diffusion prior and decoder 2022-04-27 20:11:43 -07:00
Phil Wang
d0cdeb3247 add ability for DALL-E2 to return PIL images with return_pil_images = True on forward, for those who have no clue about deep learning 2022-04-27 19:58:06 -07:00
Phil Wang
8c610aad9a only pass text encodings conditioning in diffusion prior if specified on initialization 2022-04-27 19:48:16 -07:00
Phil Wang
6700381a37 prepare for ability to integrate other clips other than x-clip 2022-04-27 19:35:05 -07:00
Phil Wang
fa3bb6ba5c make sure cpu-only still works 2022-04-27 08:02:10 -07:00
Phil Wang
2705e7c9b0 attention-based upsampling claims unsupported by local experiments, removing 2022-04-27 07:51:04 -07:00
Phil Wang
de0296106b be able to turn off warning for use of LazyLinear by passing in text embedding dimension for unet 2022-04-26 11:42:46 -07:00
Phil Wang
eafb136214 suppress a warning 2022-04-26 11:40:45 -07:00
Phil Wang
bfbcc283a3 DRY a tiny bit for gaussian diffusion related logic 2022-04-26 11:39:12 -07:00
Phil Wang
c30544b73a no CLIP altogether for training DiffusionPrior 2022-04-26 10:23:41 -07:00
Phil Wang
9878be760b have researcher explicitly state upfront whether to condition with text encodings in cascading ddpm decoder, have DALLE-2 class take care of passing in text if feature turned on 2022-04-26 09:47:09 -07:00
Phil Wang
7ba6357c05 allow for training the Prior network with precomputed CLIP embeddings (or text encodings) 2022-04-26 09:29:51 -07:00
Phil Wang
76e063e8b7 refactor so that the causal transformer in the diffusion prior network can be conditioned without text encodings (for Laions parallel efforts, although it seems from the paper it is needed) 2022-04-26 09:00:11 -07:00
Phil Wang
4d25976f33 make sure non-latent diffusion still works 2022-04-26 08:36:00 -07:00
Phil Wang
0b28ee0d01 revert back to old upsampling, paper does not work 2022-04-26 07:39:04 -07:00
Phil Wang
f75d49c781 start a file for all attention-related modules, use attention-based upsampling in the unets in dalle-2 2022-04-25 18:59:10 -07:00
Phil Wang
8f2a0c7e00 better naming 2022-04-25 07:44:33 -07:00
Phil Wang
863f4ef243 just take care of the logic for setting all latent diffusion to predict x0, if needed 2022-04-24 10:06:42 -07:00
Phil Wang
fb8a66a2de just in case latent diffusion performs better with prediction of x0 instead of epsilon, open up the research avenue 2022-04-24 10:04:22 -07:00
Phil Wang
579d4b42dd does not seem right to clip for the prior diffusion part 2022-04-24 09:51:18 -07:00
Phil Wang
473808850a some outlines to the eventual CLI endpoint 2022-04-24 09:27:15 -07:00
Phil Wang
05b74be69a use null container pattern to cleanup some conditionals, save more cleanup for next week 2022-04-22 15:23:18 -07:00
Phil Wang
76b32f18b3 first pass at complete DALL-E2 + Latent Diffusion integration, latent diffusion on any layer(s) of the cascading ddpm in the decoder. 2022-04-22 13:53:13 -07:00
Phil Wang
46cef31c86 optional projection out for prior network causal transformer 2022-04-22 11:16:30 -07:00
Phil Wang
59b1a77d4d be a bit more conservative and stick with layernorm (without bias) for now, given @borisdayma results https://twitter.com/borisdayma/status/1517227191477571585 2022-04-22 11:14:54 -07:00
Phil Wang
7f338319fd makes more sense for blur augmentation to happen before the upsampling 2022-04-22 11:10:47 -07:00
Phil Wang
2c6c91829d refactor blurring training augmentation to be taken care of by the decoder, with option to downsample to previous resolution before upsampling (cascading ddpm). this opens up the possibility of cascading latent ddpm 2022-04-22 11:09:17 -07:00
Phil Wang
ad17c69ab6 prepare for latent diffusion in the first DDPM of the cascade in the Decoder 2022-04-21 17:54:31 -07:00
Phil Wang
faebf4c8b8 from my vision transformer experience, dimension of attention head of 32 is sufficient for image feature maps 2022-04-20 11:40:32 -07:00
Phil Wang
f37c26e856 cleanup and DRY a little 2022-04-20 10:56:32 -07:00
Phil Wang
27a33e1b20 complete contextmanager method for keeping only one unet in GPU during training or inference 2022-04-20 10:46:13 -07:00
Phil Wang
6f941a219a give time tokens a surface area of 2 tokens as default, make it so researcher can customize which unet actually is conditioned on image embeddings and/or text encodings 2022-04-20 10:04:47 -07:00
Phil Wang
ddde8ca1bf fix cosine bbeta schedule, thanks to @Zhengxinyang 2022-04-19 20:54:28 -07:00