View Source Bumblebee.Text.BlipText (Bumblebee v0.5.3)

The BLIP model for text encoding.

Architectures

  • :base - the base text model

Inputs

  • "input_ids" - {batch_size, sequence_length}

    Indices of input sequence tokens in the vocabulary.

  • "attention_mask" - {batch_size, sequence_length}

    Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.

  • "position_ids" - {batch_size, sequence_length}

    Indices of positions of each input sequence tokens in the position embeddings.

  • "attention_head_mask" - {encoder_num_blocks, encoder_num_attention_heads}

    Mask to nullify selected heads of the self-attention blocks in the encoder.

  • "input_embeddings" - {batch_size, sequence_length, hidden_size}

    Embedded representation of "input_ids", which can be specified for more control over how "input_ids" are embedded than the model's internal embedding lookup. If "input_embeddings" are present, then "input_ids" will be ignored.

  • "encoder_hidden_state" - {batch_size, encoder_sequence_length, encoder_hidden_size}

    Last hidden state output from the encoder. This hidden state is used in cross-attention blocks in the decoder. If specified, the model will skip the encoding process and use this value directly for cross-attentions in the decoder.

  • "encoder_attention_mask" - {batch_size, encoder_sequence_length}

    Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.

  • "cross_attention_head_mask" - {num_blocks, num_attention_heads}

    Mask to nullify selected heads of the cross-attention blocks in the decoder with shape.

  • "cache"

    A container with cached layer results used to speed up sequential decoding (autoregression). With cache, certain hidden states are taken from the cache, rather than recomputed on every decoding pass. The cache should be treated as opaque and initialized with Bumblebee.Text.Generation.init_cache/4.

Configuration

  • :vocab_size - the vocabulary size of the token embedding. This corresponds to the number of distinct tokens that can be represented in model input and output . Defaults to 30524

  • :max_positions - the vocabulary size of the position embedding. This corresponds to the maximum sequence length that this model can process. Typically this is set to a large value just in case, such as 512, 1024 or 2048 . Defaults to 512

  • :hidden_size - the dimensionality of hidden layers. Defaults to 768

  • :encoder_hidden_size - the dimensionality of hidden layers in the vision encoder. Defaults to 768

  • :num_blocks - the number of Transformer blocks in the encoder. Defaults to 12

  • :num_attention_heads - the number of attention heads for each attention layer in the encoder. Defaults to 8

  • :intermediate_size - the dimensionality of the intermediate layer in the transformer feed-forward network (FFN) in the encoder. Defaults to 3072

  • :activation - the activation function. Defaults to :gelu

  • :dropout_rate - the dropout rate for embedding and encoder. Defaults to 0.0

  • :attention_dropout_rate - the dropout rate for attention weights. Defaults to 0.0

  • :layer_norm_epsilon - the epsilon used by the layer normalization layers. Defaults to 1.0e-12

  • :initializer_scale - the standard deviation of the normal initializer used for initializing kernel parameters. Defaults to 0.02

  • :output_hidden_states - whether the model should return all hidden states. Defaults to false

  • :output_attentions - whether the model should return all attentions. Defaults to false