# View Source Bumblebee.Diffusion.UNet2DConditional (Bumblebee v0.2.0)

U-Net model with two spatial dimensions and conditional state.

##
architectures

Architectures

`:base`

- the U-Net model

##
inputs

Inputs

`"sample"`

-`{batch_size, sample_size, sample_size, in_channels}`

Sample input with two spatial dimensions.

`"timestep"`

-`{}`

The timestep used to parameterize model behaviour in a multi-step process, such as diffusion.

`"encoder_hidden_state"`

-`{batch_size, sequence_length, hidden_size}`

The conditional state (context) to use with cross-attention.

##
configuration

Configuration

`:sample_size`

- the size of the input spatial dimensions. Defaults to`32`

`:in_channels`

- the number of channels in the input. Defaults to`4`

`:out_channels`

- the number of channels in the output. Defaults to`4`

`:center_input_sample`

- whether to center the input sample. Defaults to`false`

`:embedding_flip_sin_to_cos`

- whether to flip the sin to cos in the sinusoidal timestep embedding. Defaults to`true`

`:embedding_frequency_correction_term`

- controls the frequency formula in the timestep sinusoidal embedding. The frequency is computed as $\omega_i = \frac{1}{10000^{\frac{i}{n - s}}}$, for $i \in \{0, ..., n-1\}$, where $n$ is half of the embedding size and $s$ is the shift. Historically, certain implementations of sinusoidal embedding used $s=0$, while others used $s=1$ . Defaults to`0`

`:hidden_sizes`

- the dimensionality of hidden layers in each upsample/downsample block. Defaults to`[320, 640, 1280, 1280]`

`:depth`

- the number of residual blocks in each upsample/downsample block. Defaults to`2`

`:down_block_types`

- a list of downsample block types. The supported blocks are:`:down_block`

,`:cross_attention_down_block`

. Defaults to`[:cross_attention_down_block, :cross_attention_down_block, :cross_attention_down_block, :down_block]`

`:up_block_types`

- a list of upsample block types. The supported blocks are:`:up_block`

,`:cross_attention_up_block`

. Defaults to`[:up_block, :cross_attention_up_block, :cross_attention_up_block, :cross_attention_up_block]`

`:downsample_padding`

- the padding to use in the downsample convolution. Defaults to`[{1, 1}, {1, 1}]`

`:mid_block_scale_factor`

- the scale factor to use for the mid block. Defaults to`1`

`:num_attention_heads`

- the number of attention heads for each attention layer. Optionally can be a list with one number per block. Defaults to`8`

`:cross_attention_size`

- the dimensionality of the cross attention features. Defaults to`1280`

`:use_linear_projection`

- whether the input/output projection of the transformer block should be linear or convolutional. Defaults to`false`

`:activation`

- the activation function. Defaults to`:silu`

`:group_norm_num_groups`

- the number of groups used by the group normalization layers. Defaults to`32`

`:group_norm_epsilon`

- the epsilon used by the group normalization layers. Defaults to`1.0e-5`