View Source Bumblebee.Multimodal.Blip (Bumblebee v0.6.0)
The BLIP model for text-image similarity.
Architectures
:for_conditional_generation
- BLIP model with a language modeling head
Inputs
"pixel_values"
-{batch_size, image_size, image_size, num_channels}
Featurized image pixel values.
"decoder_input_ids"
-{batch_size, target_sequence_length}
Indices of decoder input sequence tokens in the vocabulary. If not present and
"input_ids"
is, it will be generated by shifting each token in"input_ids"
to the right once."decoder_attention_mask"
-{batch_size, target_sequence_length}
Mask indicating which decoder tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"decoder_position_ids"
-{batch_size, target_sequence_length}
Indices of positions of each decoder input sequence tokens in the position embeddings.
"encoder_hidden_state"
-{batch_size, sequence_length, hidden_size}
Last hidden state output from the encoder. This hidden state is used in cross-attention blocks in the decoder. If specified, the model will skip the image encoding process and use this value directly for cross-attentions in the text decoder.
"cache"
A container with cached layer results used to speed up sequential decoding (autoregression). With cache, certain hidden states are taken from the cache, rather than recomputed on every decoding pass. The cache should be treated as opaque and initialized with
Bumblebee.Text.Generation.init_cache/4
.
Global layer options
:output_hidden_states
- whentrue
, the model output includes all hidden states:output_attentions
- whentrue
, the model output includes all attention weights
Configuration
:text_spec
- the specification of the text model. SeeBumblebee.Text.BlipText
for details:vision_spec
- the specification of the vision model. SeeBumblebee.Vision.BlipVision
for details:projection_size
- the dimensionality of text and vision projection layers. Defaults to512
:logit_scale_initial_value
- the initial value for the scaling layer used to scale similarity logits. Defaults to2.6592