View Source Bumblebee.Multimodal.Clip (Bumblebee v0.5.3)
The CLIP model for text-image similarity.
Architectures
:base- the base CLIP model
Inputs
"input_ids"-{batch_size, sequence_length}Indices of input sequence tokens in the vocabulary.
"attention_mask"-{batch_size, sequence_length}Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"position_ids"-{batch_size, sequence_length}Indices of positions of each input sequence tokens in the position embeddings.
"pixel_values"-{batch_size, image_size, image_size, num_channels}Featurized image pixel values.
Configuration
:text_spec- the specification of the text model. SeeBumblebee.Text.ClipTextfor details:vision_spec- the specification of the vision model. SeeBumblebee.Vision.ClipVisionfor details:projection_size- the dimensionality of text and vision projection layers. Defaults to512:logit_scale_initial_value- the initial value for the scaling layer used to scale similarity logits. Defaults to2.6592:output_hidden_states- whether the model should return all hidden states. Defaults tofalse:output_attentions- whether the model should return all attentions. Defaults tofalse