View Source Bumblebee.Diffusion.UNet2DConditional (Bumblebee v0.6.0)
U-Net model with two spatial dimensions and conditioning state.
Architectures
:base
- the U-Net model
Inputs
"sample"
-{batch_size, sample_size, sample_size, in_channels}
Sample input with two spatial dimensions.
"timestep"
-{}
The timestep used to parameterize model behaviour in a multi-step process, such as diffusion.
"encoder_hidden_state"
-{batch_size, sequence_length, hidden_size}
The conditioning state (context) to use with cross-attention.
"additional_down_block_states"
Optional outputs matching the structure of down blocks, added as part of the encoder-decoder skip connections.
"additional_mid_block_state"
Optional output added to the mid block result.
Configuration
:sample_size
- the size of the input spatial dimensions. Defaults to32
:in_channels
- the number of channels in the input. Defaults to4
:out_channels
- the number of channels in the output. Defaults to4
:center_input_sample
- whether to center the input sample. Defaults tofalse
:embedding_flip_sin_to_cos
- whether to flip the sin to cos in the sinusoidal timestep embedding. Defaults totrue
:embedding_frequency_correction_term
- controls the frequency formula in the timestep sinusoidal embedding. The frequency is computed as $\\omega_i = \\frac{1}{10000^{\\frac{i}{n - s}}}$, for $i \\in \\{0, ..., n-1\\}$, where $n$ is half of the embedding size and $s$ is the shift. Historically, certain implementations of sinusoidal embedding used $s=0$, while others used $s=1$ . Defaults to0
:hidden_sizes
- the dimensionality of hidden layers in each upsample/downsample block. Defaults to[320, 640, 1280, 1280]
:depth
- the number of residual blocks in each upsample/downsample block. Defaults to2
:down_block_types
- a list of downsample block types. The supported blocks are::down_block
,:cross_attention_down_block
. Defaults to[:cross_attention_down_block, :cross_attention_down_block, :cross_attention_down_block, :down_block]
:up_block_types
- a list of upsample block types. The supported blocks are::up_block
,:cross_attention_up_block
. Defaults to[:up_block, :cross_attention_up_block, :cross_attention_up_block, :cross_attention_up_block]
:downsample_padding
- the padding to use in the downsample convolution. Defaults to[{1, 1}, {1, 1}]
:mid_block_scale_factor
- the scale factor to use for the mid block. Defaults to1
:num_attention_heads
- the number of attention heads for each attention layer. Optionally can be a list with one number per block. Defaults to8
:cross_attention_size
- the dimensionality of the cross attention features. Defaults to1280
:use_linear_projection
- whether the input/output projection of the transformer block should be linear or convolutional. Defaults tofalse
:activation
- the activation function. Defaults to:silu
:group_norm_num_groups
- the number of groups used by the group normalization layers. Defaults to32
:group_norm_epsilon
- the epsilon used by the group normalization layers. Defaults to1.0e-5