Phil Wang
|
6fee4fce6e
|
also allow for image embedding to be passed into the diffusion model, in the case one wants to generate image embedding once and then train multiple unets in one iteration
|
2022-04-18 14:00:38 -07:00 |
|
Phil Wang
|
960a79857b
|
use some magic just this once to remove the need for researchers to think
|
2022-04-18 12:40:43 -07:00 |
|
Phil Wang
|
00ae50999b
|
make kernel size and sigma for gaussian blur for cascading DDPM overridable at forward. also make sure unets are wrapped in a modulelist so that at sample time, blurring does not happen
|
2022-04-18 12:04:31 -07:00 |
|
Phil Wang
|
0332eaa6ff
|
complete first pass at full cascading DDPM setup in Decoder, flexible enough to support one unet for testing
|
2022-04-18 11:44:56 -07:00 |
|
Kashif Rasul
|
b0f2fbaa95
|
schedule to Prior
|
2022-04-17 15:21:47 +02:00 |
|
Kashif Rasul
|
51361c2d15
|
added beta_schedule argument
|
2022-04-17 15:19:33 +02:00 |
|
Kashif Rasul
|
42d6e47387
|
added huber loss and other schedulers
|
2022-04-17 15:14:05 +02:00 |
|
Phil Wang
|
c400d8758c
|
prepare for cascading diffusion in unet, save the full progressive upsampling architecture to be built next week
|
2022-04-15 07:03:28 -07:00 |
|
Phil Wang
|
bece206699
|
fix bug thanks to @jihoonerd
|
2022-04-15 06:44:40 -07:00 |
|
Phil Wang
|
6e27f617f1
|
use t5 relative positional bias in prior network causal transformer, since it makes more sense than rotary embeddings
|
2022-04-14 12:01:09 -07:00 |
|
Phil Wang
|
9f55c24db6
|
allow for decoder conditioning with the text encodings from CLIP, if it is passed in. use lazy linear to avoid researchers having to worry about text encoding dimensions, but remove later if it does not work well
|
2022-04-14 11:46:45 -07:00 |
|
Phil Wang
|
23c401a5d5
|
use the eval decorator
|
2022-04-14 10:13:43 -07:00 |
|
Phil Wang
|
68e9883f59
|
use cross attention for conditioning unet based on image embedding tokens (which opens up the door on conditioning on text encodings as well
|
2022-04-14 10:10:04 -07:00 |
|
Phil Wang
|
95b018374a
|
start using swish glu everywhere, given success of PaLM
|
2022-04-14 09:34:32 -07:00 |
|
Phil Wang
|
8b5c2385b0
|
better naming
|
2022-04-14 09:24:31 -07:00 |
|
Phil Wang
|
f2c52d8239
|
fix bug with classifier free guidance for prior network, even though it seems it may not be used
|
2022-04-14 09:21:51 -07:00 |
|
Phil Wang
|
97e951221b
|
bring in blur, as it will be used somewhere in the cascading DDPM in the decoder eventually, once i figure it out
|
2022-04-14 09:16:09 -07:00 |
|
Phil Wang
|
82464d7bd3
|
per-fect
|
2022-04-14 08:30:07 -07:00 |
|
Phil Wang
|
7fb3f695d5
|
offer continuously parameterized time embedding for diffusion prior network, remove a hyperparameter that may trip up people, if not set correctly
|
2022-04-14 08:28:11 -07:00 |
|
Phil Wang
|
7e93b9d3c8
|
make sure classifier free guidance condition scaling is exposed on DALLE2 forward function
|
2022-04-13 20:14:28 -07:00 |
|
Phil Wang
|
14ddbc159c
|
cleanup
|
2022-04-13 18:24:32 -07:00 |
|
Phil Wang
|
5e06cde4cb
|
always work in the l2normed space for image and text embeddings
|
2022-04-13 18:08:42 -07:00 |
|
Phil Wang
|
a1a8a78f21
|
fix everything and make sure it runs end to end, document everything in readme for public
|
2022-04-13 18:05:25 -07:00 |
|
Phil Wang
|
e5e415297c
|
prepare non-causal attention, for use in the unet in the decoder
|
2022-04-13 12:04:09 -07:00 |
|
Phil Wang
|
c9377efc93
|
go for the multi-headed queries, one-headed key/values, proven out in AlphaCode as well as PaLM by now
|
2022-04-13 12:01:43 -07:00 |
|
Phil Wang
|
d3cded3c6c
|
complete logic in diffusion prior for sampling more than 1 image embeds, taking top similarity
|
2022-04-13 10:52:31 -07:00 |
|
Phil Wang
|
d573c82f8c
|
add one full attention at the middle of the unet, prepare to do efficient attention employing every trick i know from vision transformer literature
|
2022-04-13 10:39:06 -07:00 |
|
Phil Wang
|
3aa6f91e7a
|
be transparent
|
2022-04-13 10:32:11 -07:00 |
|
Phil Wang
|
1bf071af78
|
allow for predicting image embedding directly during diffusion training. need to fix sampling still
|
2022-04-13 10:29:29 -07:00 |
|
Phil Wang
|
791d27326a
|
add diffusion code for the image embedding. nearly all the code is there except for the cascading ddpm in the decoder (with upscaling etc)
|
2022-04-13 10:06:52 -07:00 |
|
Phil Wang
|
33d69d3859
|
take care of DDPM decoder (DDPM for producing image embedding will have a separate objective, predicting directly the embedding rather than the noise [epsilon in paper])
|
2022-04-12 17:48:41 -07:00 |
|
Phil Wang
|
862e5ba50e
|
more sketches to base dalle2 class
|
2022-04-12 17:31:01 -07:00 |
|
Phil Wang
|
25d980ebbf
|
complete naive conditioning of unet with image embedding, with ability to dropout for classifier free guidance
|
2022-04-12 17:27:39 -07:00 |
|
Phil Wang
|
d546a615c0
|
complete helper methods for doing condition scaling (classifier free guidance), for decoder unet and prior network
|
2022-04-12 16:11:16 -07:00 |
|
Phil Wang
|
d4c8373635
|
complete conditional dropout mask creation for both prior network as well as image decoder unet for classifier free guidance
|
2022-04-12 14:04:08 -07:00 |
|
Phil Wang
|
74aec9d8ca
|
further prepare attention for classifier free guidance
|
2022-04-12 13:01:18 -07:00 |
|
Phil Wang
|
7647be2569
|
prep for classifier free guidance for the image embedding diffusion step, even though not mentioned in paper
|
2022-04-12 12:57:09 -07:00 |
|
Phil Wang
|
59b8abe09e
|
prepare unet to be conditioned on image embedding, optionally text encodings, and reminder for self to build conditional dropout for classifier free guidance
|
2022-04-12 12:38:56 -07:00 |
|
Phil Wang
|
40aa304b7e
|
rename to DiffusionPriorNetwork in case ARPriorNetwork is ever built
|
2022-04-12 11:45:57 -07:00 |
|
Phil Wang
|
fd38eb83c4
|
complete the main contribution of the paper, the diffusion prior network, minus the diffusion training setup
|
2022-04-12 11:43:59 -07:00 |
|
Phil Wang
|
83aabd42ca
|
move epsilon inside of square root for further stability in rmsnorm
improvise and use rmsnorm in convnext blocks too
|
2022-04-12 11:18:36 -07:00 |
|
Phil Wang
|
cf22affcbb
|
bring in modified unet using convnext blocks https://arxiv.org/abs/2201.03545
|
2022-04-12 10:58:44 -07:00 |
|
Phil Wang
|
522f42f582
|
start using RMSNorm, used in Gopher and AlphaCode, and as a way to go complete bias-less (purportedly more stable according to PaLM)
|
2022-04-12 10:45:03 -07:00 |
|
Phil Wang
|
0a60818965
|
dropouts in transformer, also prep for classifier free guidance in decoder
|
2022-04-12 10:42:57 -07:00 |
|
Phil Wang
|
771fe0d0d2
|
also consider accepting tokenizer, so dalle2 forward pass can just be invoked as DALLE2(<prompt string>)
|
2022-04-12 10:29:29 -07:00 |
|
Phil Wang
|
df4dac4f5a
|
bring in attention - it is all we need
|
2022-04-12 10:23:07 -07:00 |
|
Phil Wang
|
24b428bdfc
|
readme
|
2022-04-12 10:12:42 -07:00 |
|
Phil Wang
|
62c0d321a6
|
sketch
|
2022-04-12 09:39:42 -07:00 |
|
Phil Wang
|
7cf1637d24
|
bring in the simple tokenizer released by openai, but also plan on leaving room for custom tokenizer with yttm
|
2022-04-12 09:23:17 -07:00 |
|
Phil Wang
|
4ff6d021c9
|
pin to newer version of CLIP that returns encoded text and images, get some helper functions ready for XCLIP
|
2022-04-12 08:54:47 -07:00 |
|