View Source Bumblebee.Multimodal.Blip (Bumblebee v0.6.0)

The BLIP model for text-image similarity.

Architectures

"pixel_values" - {batch_size, image_size, image_size, num_channels}
Featurized image pixel values.
"decoder_input_ids" - {batch_size, target_sequence_length}
Indices of decoder input sequence tokens in the vocabulary. If not present and "input_ids" is, it will be generated by shifting each token in "input_ids" to the right once.
"decoder_attention_mask" - {batch_size, target_sequence_length}
Mask indicating which decoder tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"decoder_position_ids" - {batch_size, target_sequence_length}
Indices of positions of each decoder input sequence tokens in the position embeddings.
"encoder_hidden_state" - {batch_size, sequence_length, hidden_size}
Last hidden state output from the encoder. This hidden state is used in cross-attention blocks in the decoder. If specified, the model will skip the image encoding process and use this value directly for cross-attentions in the text decoder.
"cache"
A container with cached layer results used to speed up sequential decoding (autoregression). With cache, certain hidden states are taken from the cache, rather than recomputed on every decoding pass. The cache should be treated as opaque and initialized with Bumblebee.Text.Generation.init_cache/4.

:output_hidden_states - when true, the model output includes all hidden states
:output_attentions - when true, the model output includes all attention weights

:text_spec - the specification of the text model. See Bumblebee.Text.BlipText for details
:vision_spec - the specification of the vision model. See Bumblebee.Vision.BlipVision for details
:projection_size - the dimensionality of text and vision projection layers. Defaults to 512
:logit_scale_initial_value - the initial value for the scaling layer used to scale similarity logits. Defaults to 2.6592