View Source Bumblebee.Multimodal.Clip (Bumblebee v0.6.0)

The CLIP model for text-image similarity.

Architectures

"input_ids" - {batch_size, sequence_length}
Indices of input sequence tokens in the vocabulary.
"attention_mask" - {batch_size, sequence_length}
Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"position_ids" - {batch_size, sequence_length}
Indices of positions of each input sequence tokens in the position embeddings.
"pixel_values" - {batch_size, image_size, image_size, num_channels}
Featurized image pixel values.

* `:output_hidden_states` - when `true`, the model output includes all hidden states

:output_attentions - when true, the model output includes all attention weights

:text_spec - the specification of the text model. See Bumblebee.Text.ClipText for details
:vision_spec - the specification of the vision model. See Bumblebee.Vision.ClipVision for details
:projection_size - the dimensionality of text and vision projection layers. Defaults to 512
:logit_scale_initial_value - the initial value for the scaling layer used to scale similarity logits. Defaults to 2.6592