View Source Bumblebee.Text.T5 (Bumblebee v0.6.0)

T5 model family.

Architectures

:base - plain T5 without any head on top
:for_conditional_generation - T5 with a language modeling head. The head returns logits for each token in the original sequence
:encoder - just the encoder part of the base model

Inputs

"input_ids" - {batch_size, sequence_length}
Indices of input sequence tokens in the vocabulary.
"attention_mask" - {batch_size, sequence_length}
Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"attention_head_mask" - {encoder_num_blocks, encoder_num_attention_heads}
Mask to nullify selected heads of the self-attention blocks in the encoder.
"input_embeddings" - {batch_size, sequence_length, hidden_size}
Embedded representation of "input_ids", which can be specified for more control over how "input_ids" are embedded than the model's internal embedding lookup. If "input_embeddings" are present, then "input_ids" will be ignored.
"decoder_input_ids" - {batch_size, target_sequence_length}
Indices of decoder input sequence tokens in the vocabulary. If not present and "input_ids" is, it will be generated by shifting each token in "input_ids" to the right once.
"decoder_attention_mask" - {batch_size, target_sequence_length}
Mask indicating which decoder tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"decoder_attention_head_mask" - {decoder_num_blocks, decoder_num_attention_heads}
Mask to nullify selected heads of the self-attention blocks in the decoder.
"decoder_input_embeddings" - {batch_size, sequence_length, hidden_size}
Embedded representation of "decoder_input_ids", which can be specified for more control over how "decoder_input_ids" are embedded than the model's internal embedding lookup. If "decoder_input_embeddings" are present, then "decoder_input_ids" will be ignored.
"encoder_hidden_state" - {batch_size, sequence_length, hidden_size}
Last hidden state output from the encoder. This hidden state is used in cross-attention blocks in the decoder. If specified, the model will skip the encoding process and use this value directly for cross-attentions in the decoder.
"cross_attention_head_mask" - {decoder_num_blocks, decoder_num_attention_heads}
Mask to nullify selected heads of the cross-attention blocks in the decoder with shape.
"cache"
A container with cached layer results used to speed up sequential decoding (autoregression). With cache, certain hidden states are taken from the cache, rather than recomputed on every decoding pass. The cache should be treated as opaque and initialized with Bumblebee.Text.Generation.init_cache/4.

Global layer options

:output_hidden_states - when true, the model output includes all hidden states
:output_attentions - when true, the model output includes all attention weights

Configuration

:vocab_size - the vocabulary size of the token embedding. This corresponds to the number of distinct tokens that can be represented in model input and output . Defaults to 32128
:hidden_size - the dimensionality of hidden layers. Defaults to 512
:attention_head_size - the size of the key, value, and query projection per attention head. Defaults to 64
:encoder_num_blocks - the number of Transformer blocks in the encoder. Defaults to 6
:decoder_num_blocks - the number of Transformer blocks in the decoder. Defaults to 6
:encoder_num_attention_heads - the number of attention heads for each attention layer in the encoder. Defaults to 8
:decoder_num_attention_heads - the number of attention heads for each attention layer in the decoder. Defaults to 8
:activation - the activation function. Defaults to :relu
:ffn_gated_activation - whether to use a gated variant of the activation function in the feed-forward network (FFN). Defaults to false
:dropout_rate - the dropout rate for encoder and decoder. Defaults to 0.1
:initializer_scale - the standard deviation of the normal initializer used for initializing kernel parameters. Defaults to 1.0
:layer_norm_epsilon - the epsilon used by the layer normalization layers. Defaults to 1.0e-6
:tie_word_embeddings - whether or not to tie encoder and decoder token embedding. Defaults to true
:num_labels - the number of labels to use in the last layer for the classification task. Defaults to 2
:id_to_label - a map from class index to label. Defaults to %{}