View Source Bumblebee.Multimodal.Clip (Bumblebee v0.6.0)

The CLIP model for text-image similarity.

Architectures

  • :base - the base CLIP model

Inputs

  • "input_ids" - {batch_size, sequence_length}

    Indices of input sequence tokens in the vocabulary.

  • "attention_mask" - {batch_size, sequence_length}

    Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.

  • "position_ids" - {batch_size, sequence_length}

    Indices of positions of each input sequence tokens in the position embeddings.

  • "pixel_values" - {batch_size, image_size, image_size, num_channels}

    Featurized image pixel values.

Global layer options

* `:output_hidden_states` - when `true`, the model output includes all hidden states
  • :output_attentions - when true, the model output includes all attention weights

Configuration

  • :text_spec - the specification of the text model. See Bumblebee.Text.ClipText for details

  • :vision_spec - the specification of the vision model. See Bumblebee.Vision.ClipVision for details

  • :projection_size - the dimensionality of text and vision projection layers. Defaults to 512

  • :logit_scale_initial_value - the initial value for the scaling layer used to scale similarity logits. Defaults to 2.6592

References