View Source Bumblebee.Text.Roberta (Bumblebee v0.4.2)

RoBERTa model family.

Architectures

:base - plain RoBERTa without any head on top
:for_masked_language_modeling - RoBERTa with a language modeling head. The head returns logits for each token in the original sequence
:for_sequence_classification - RoBERTa with a sequence classification head. The head returns logits corresponding to possible classes
:for_token_classification - RoBERTa with a token classification head. The head returns logits for each token in the original sequence
:for_question_answering - RoBERTa with a span classification head. The head returns logits for the span start and end positions
:for_multiple_choice - RoBERTa with a multiple choice prediction head. Each input in the batch consists of several sequences to choose from and the model returns logits corresponding to those choices
:for_causal_language_modeling - RoBERTa working as a decoder with a language modeling head. The head returns logits for each token in the original sequence

Inputs

"input_ids" - {batch_size, sequence_length}
Indices of input sequence tokens in the vocabulary.
"attention_mask" - {batch_size, sequence_length}
Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"token_type_ids" - {batch_size, sequence_length}
Mask distinguishing groups in the input sequence. This is used in when the input sequence is a semantically a pair of sequences.
"position_ids" - {batch_size, sequence_length}
Indices of positions of each input sequence tokens in the position embeddings.
"attention_head_mask" - {num_blocks, num_attention_heads}
Mask to nullify selected heads of the self-attention blocks in the encoder.

Exceptions

The :for_multiple_choice model accepts groups of sequences, so the expected sequence shape is {batch_size, num_choices, sequence_length}.

The :for_causal_language_modeling model is a decoder and accepts the following additional inputs: "encoder_hidden_state", "encoder_attention_mask", "cross_attention_head_mask", "cache".

Configuration

:vocab_size - the vocabulary size of the token embedding. This corresponds to the number of distinct tokens that can be represented in model input and output . Defaults to 30522
:max_positions - the vocabulary size of the position embedding. This corresponds to the maximum sequence length that this model can process. Typically this is set to a large value just in case, such as 512, 1024 or 2048 . Defaults to 512
:max_token_types - the vocabulary size of the token type embedding (also referred to as segment embedding). This corresponds to how many different token groups can be distinguished in the input . Defaults to 2
:hidden_size - the dimensionality of hidden layers. Defaults to 768
:num_blocks - the number of Transformer blocks in the encoder. Defaults to 12
:num_attention_heads - the number of attention heads for each attention layer in the encoder. Defaults to 12
:intermediate_size - the dimensionality of the intermediate layer in the transformer feed-forward network (FFN) in the encoder. Defaults to 3072
:activation - the activation function. Defaults to :gelu
:dropout_rate - the dropout rate for embedding and encoder. Defaults to 0.1
:attention_dropout_rate - the dropout rate for attention weights. Defaults to 0.1
:classifier_dropout_rate - the dropout rate for the classification head. If not specified, the value of :dropout_rate is used instead
:layer_norm_epsilon - the epsilon used by the layer normalization layers. Defaults to 1.0e-12
:initializer_range - the standard deviation of the normal initializer used for initializing kernel parameters. Defaults to 0.02
:output_hidden_states - whether the model should return all hidden states. Defaults to false
:output_attentions - whether the model should return all attentions. Defaults to false
:num_labels - the number of labels to use in the last layer for the classification task. Defaults to 2
:id_to_label - a map from class index to label. Defaults to %{}
:use_cross_attention - whether cross-attention layers should be added to the model.This is only relevant for decoder models. Defaults to false

Settings View Source Bumblebee.Text.Roberta (Bumblebee v0.4.2)

Architectures

Inputs

Exceptions

Configuration

View Source Bumblebee.Text.Roberta (Bumblebee v0.4.2)