View Source Bumblebee.Multimodal.Clip (Bumblebee v0.5.3)

The CLIP model for text-image similarity.

Architectures

  • :base - the base CLIP model

Inputs

  • "input_ids" - {batch_size, sequence_length}

    Indices of input sequence tokens in the vocabulary.

  • "attention_mask" - {batch_size, sequence_length}

    Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.

  • "position_ids" - {batch_size, sequence_length}

    Indices of positions of each input sequence tokens in the position embeddings.

  • "pixel_values" - {batch_size, image_size, image_size, num_channels}

    Featurized image pixel values.

Configuration

  • :text_spec - the specification of the text model. See Bumblebee.Text.ClipText for details

  • :vision_spec - the specification of the vision model. See Bumblebee.Vision.ClipVision for details

  • :projection_size - the dimensionality of text and vision projection layers. Defaults to 512

  • :logit_scale_initial_value - the initial value for the scaling layer used to scale similarity logits. Defaults to 2.6592

  • :output_hidden_states - whether the model should return all hidden states. Defaults to false

  • :output_attentions - whether the model should return all attentions. Defaults to false

References