Edifice.Attention.Conformer (Edifice v0.2.0)

Copy Markdown View Source

Conformer: convolution-augmented transformer for audio/speech processing.

The Conformer combines self-attention with convolution to capture both global and local patterns. It uses a Macaron-style architecture with two half-step feed-forward modules sandwiching the attention and convolution modules.

Architecture (Macaron Block)

Input [batch, seq_len, hidden_size]
      |
+------------------------------------------------+
|   Conformer Block (x num_layers)               |
|                                                |
|   1. Half-FFN: norm -> FFN -> scale(0.5)       |
|      -> residual                               |
|   2. MHSA: norm -> self_attention -> residual  |
|   3. Conv module:                              |
|      norm -> pointwise_up -> GLU               |
|      -> depthwise_conv -> norm -> act           |
|      -> pointwise_down -> residual             |
|   4. Half-FFN: norm -> FFN -> scale(0.5)       |
|      -> residual                               |
|   5. Final LayerNorm                           |
+------------------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]

Usage

model = Conformer.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 4,
  conv_kernel_size: 31,
  num_layers: 4
)

References

  • "Conformer: Convolution-augmented Transformer for Speech Recognition" (Gulati et al., 2020)

Summary

Types

Options for build/1.

Functions

Build a Conformer model.

Build a single Conformer block with the Macaron structure.

Get the output size of a Conformer model.

Types

build_opt()

@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:conv_kernel_size, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:dropout, float()}
  | {:window_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a Conformer model.

Options

  • :embed_dim - Size of input embedding per timestep (required)
  • :hidden_size - Internal hidden dimension (default: 256)
  • :num_heads - Number of attention heads (default: 4)
  • :conv_kernel_size - Kernel size for depthwise convolution (default: 31)
  • :num_layers - Number of Conformer blocks (default: 4)
  • :dropout - Dropout rate (default: 0.1)
  • :window_size - Expected sequence length for JIT optimization (default: 60)

Returns

An Axon model that outputs [batch, hidden_size] from the last position.

build_conformer_block(input, opts)

@spec build_conformer_block(
  Axon.t(),
  keyword()
) :: Axon.t()

Build a single Conformer block with the Macaron structure.

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of a Conformer model.