View Source Bumblebee.Text.GptNeoX (Bumblebee v0.5.3)

GPT-NeoX model family.

Architectures

:base - plain GPT-NeoX without any head on top
:for_causal_language_modeling - GPT-NeoX with a language modeling head. The head returns logits for each token in the original sequence
:for_sequence_classification - GPT-NeoX with a sequence classification head. The head returns logits corresponding to possible classes
:for_token_classification - GPT-NeoX with a token classification head. The head returns logits for each token in the original sequence

Inputs

"input_ids" - {batch_size, sequence_length}
Indices of input sequence tokens in the vocabulary.
"attention_mask" - {batch_size, sequence_length}
Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"position_ids" - {batch_size, sequence_length}
Indices of positions of each input sequence tokens in the position embeddings.
"attention_head_mask" - {encoder_num_blocks, encoder_num_attention_heads}
Mask to nullify selected heads of the self-attention blocks in the encoder.
"input_embeddings" - {batch_size, sequence_length, hidden_size}
Embedded representation of "input_ids", which can be specified for more control over how "input_ids" are embedded than the model's internal embedding lookup. If "input_embeddings" are present, then "input_ids" will be ignored.
"cache"
A container with cached layer results used to speed up sequential decoding (autoregression). With cache, certain hidden states are taken from the cache, rather than recomputed on every decoding pass. The cache should be treated as opaque and initialized with Bumblebee.Text.Generation.init_cache/4.

Configuration

:vocab_size - the vocabulary size of the token embedding. This corresponds to the number of distinct tokens that can be represented in model input and output . Defaults to 32000
:hidden_size - the dimensionality of hidden layers. Defaults to 4096
:intermediate_size - the dimensionality of intermediate layers. Defaults to 11008
:num_blocks - the number of Transformer blocks in the model. Defaults to 32
:num_attention_heads - the number of attention heads for each attention layer in the model. Defaults to 32
:activation - the activation function. Defaults to :silu
:rotary_embedding_percentage - percentage of hidden dimensions to allocate to rotary embeddings. Defaults to 0.25
:rotary_embedding_base - base for computing rotary embedding frequency. Defaults to 10000
:classifier_dropout_rate - the dropout rate for the classification head. Defaults to 0.1
:layer_norm_epsilon - the epsilon used by RMS normalization layers. Defaults to 1.0e-12
:initializer_scale - the standard deviation of the normal initializer used for initializing kernel parameters. Defaults to 0.02
:use_parallel_transformer_block - whether to use the parallel formulation of the Transformer block, where attention and FFN is computed independently. Defaults to true
:output_hidden_states - whether the model should return all hidden states. Defaults to false
:output_attentions - whether the model should return all attentions. Defaults to false
:num_labels - the number of labels to use in the last layer for the classification task. Defaults to 2
:id_to_label - a map from class index to label. Defaults to %{}