Edifice.Attention.Conformer (Edifice v0.2.0)

Conformer: convolution-augmented transformer for audio/speech processing.

The Conformer combines self-attention with convolution to capture both global and local patterns. It uses a Macaron-style architecture with two half-step feed-forward modules sandwiching the attention and convolution modules.

Architecture (Macaron Block)

Input [batch, seq_len, hidden_size]
      |
+------------------------------------------------+
|   Conformer Block (x num_layers)               |
|                                                |
|   1. Half-FFN: norm -> FFN -> scale(0.5)       |
|      -> residual                               |
|   2. MHSA: norm -> self_attention -> residual  |
|   3. Conv module:                              |
|      norm -> pointwise_up -> GLU               |
|      -> depthwise_conv -> norm -> act           |
|      -> pointwise_down -> residual             |
|   4. Half-FFN: norm -> FFN -> scale(0.5)       |
|      -> residual                               |
|   5. Final LayerNorm                           |
+------------------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]

Usage

model = Conformer.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 4,
  conv_kernel_size: 31,
  num_layers: 4
)

References

"Conformer: Convolution-augmented Transformer for Speech Recognition" (Gulati et al., 2020)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Conformer model.

build_conformer_block(input, opts)

Build a single Conformer block with the Macaron structure.

output_size(opts \\ [])

Get the output size of a Conformer model.