View Source Bumblebee.Multimodal.Clip (Bumblebee v0.6.0)
The CLIP model for text-image similarity.
Architectures
:base
- the base CLIP model
Inputs
"input_ids"
-{batch_size, sequence_length}
Indices of input sequence tokens in the vocabulary.
"attention_mask"
-{batch_size, sequence_length}
Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"position_ids"
-{batch_size, sequence_length}
Indices of positions of each input sequence tokens in the position embeddings.
"pixel_values"
-{batch_size, image_size, image_size, num_channels}
Featurized image pixel values.
Global layer options
* `:output_hidden_states` - when `true`, the model output includes all hidden states
:output_attentions
- whentrue
, the model output includes all attention weights
Configuration
:text_spec
- the specification of the text model. SeeBumblebee.Text.ClipText
for details:vision_spec
- the specification of the vision model. SeeBumblebee.Vision.ClipVision
for details:projection_size
- the dimensionality of text and vision projection layers. Defaults to512
:logit_scale_initial_value
- the initial value for the scaling layer used to scale similarity logits. Defaults to2.6592