View Source NxAudio.Transforms.MelSpectrogram (nx_audio v0.2.0)

Implements Mel-scaled spectrograms - a perceptually-motivated time-frequency representation of audio.

The mel spectrogram applies mel filterbanks to a regular spectrogram, mapping linear frequency bins to the mel scale that better approximates human auditory perception.

Mel Scale

The mel scale relates perceived frequency to actual frequency in Hz. Two main formulas are supported:

HTK (default): $$ m = 2595 \log_{10}(1 + \frac{f}{700}) $$

Slaney: $$ m = \begin{cases} f/f_{min} \cdot m_{min} & f < f_{min} \ m_{min} + \log(f/f_{min})/\text{step} & f \geq f_{min} \end{cases} $$

where $f_{min}=1000$, $m_{min}=25$, $\text{step}=\log(6.4)/27$

Filterbank Construction

Mel filterbanks are triangular overlapping windows spaced uniformly on the mel scale:

  1. Convert frequencies to mel scale
  2. Create n_mels + 2 points evenly spaced in mel scale
  3. Convert back to Hz to get filterbank center frequencies
  4. Create triangular filters:

$$ H_m(k) = \begin{cases} 0 & k < f(m-1) \ \frac{k - f(m-1)}{f(m) - f(m-1)} & f(m-1) \leq k < f(m) \ \frac{f(m+1) - k}{f(m+1) - f(m)} & f(m) \leq k < f(m+1) \ 0 & k \geq f(m+1) \end{cases} $$

where $f(m)$ is the frequency of filterbank $m$.

Applications

Mel spectrograms are widely used in:

  • Speech recognition
  • Music information retrieval
  • Audio classification
  • Sound event detection
  • Speaker identification

By mapping frequencies to a perceptual scale and reducing dimensionality, mel spectrograms provide an efficient and meaningful audio representation.

Summary

Functions

Computes the mel-scaled spectrogram of an audio signal.

Functions

transform(audio_tensor, opts \\ [])

Computes the mel-scaled spectrogram of an audio signal.

Args: audio_tensor: Input audio tensor of shape [samples] or [channels, samples] config: MelSpectrogram configuration options NxAudio.Transforms.MelSpectrogramConfig

Returns: If input is [samples]: Returns tensor of shape [time, n_mels] If input is [channels, samples]: Returns tensor of shape [channels, time, n_mels]