View Source Bumblebee.Diffusion.UNet2DConditional (Bumblebee v0.5.3)

U-Net model with two spatial dimensions and conditional state.

Architectures

  • :base - the U-Net model

Inputs

  • "sample" - {batch_size, sample_size, sample_size, in_channels}

    Sample input with two spatial dimensions.

  • "timestep" - {}

    The timestep used to parameterize model behaviour in a multi-step process, such as diffusion.

  • "encoder_hidden_state" - {batch_size, sequence_length, hidden_size}

    The conditional state (context) to use with cross-attention.

Configuration

  • :sample_size - the size of the input spatial dimensions. Defaults to 32

  • :in_channels - the number of channels in the input. Defaults to 4

  • :out_channels - the number of channels in the output. Defaults to 4

  • :center_input_sample - whether to center the input sample. Defaults to false

  • :embedding_flip_sin_to_cos - whether to flip the sin to cos in the sinusoidal timestep embedding. Defaults to true

  • :embedding_frequency_correction_term - controls the frequency formula in the timestep sinusoidal embedding. The frequency is computed as $\\omega_i = \\frac{1}{10000^{\\frac{i}{n - s}}}$, for $i \\in \\{0, ..., n-1\\}$, where $n$ is half of the embedding size and $s$ is the shift. Historically, certain implementations of sinusoidal embedding used $s=0$, while others used $s=1$ . Defaults to 0

  • :hidden_sizes - the dimensionality of hidden layers in each upsample/downsample block. Defaults to [320, 640, 1280, 1280]

  • :depth - the number of residual blocks in each upsample/downsample block. Defaults to 2

  • :down_block_types - a list of downsample block types. The supported blocks are: :down_block, :cross_attention_down_block. Defaults to [:cross_attention_down_block, :cross_attention_down_block, :cross_attention_down_block, :down_block]

  • :up_block_types - a list of upsample block types. The supported blocks are: :up_block, :cross_attention_up_block. Defaults to [:up_block, :cross_attention_up_block, :cross_attention_up_block, :cross_attention_up_block]

  • :downsample_padding - the padding to use in the downsample convolution. Defaults to [{1, 1}, {1, 1}]

  • :mid_block_scale_factor - the scale factor to use for the mid block. Defaults to 1

  • :num_attention_heads - the number of attention heads for each attention layer. Optionally can be a list with one number per block. Defaults to 8

  • :cross_attention_size - the dimensionality of the cross attention features. Defaults to 1280

  • :use_linear_projection - whether the input/output projection of the transformer block should be linear or convolutional. Defaults to false

  • :activation - the activation function. Defaults to :silu

  • :group_norm_num_groups - the number of groups used by the group normalization layers. Defaults to 32

  • :group_norm_epsilon - the epsilon used by the group normalization layers. Defaults to 1.0e-5