View Source Membrane.RTP.Vad.IsSpeakingEstimator (Membrane RTP plugin v0.28.0)

Module for estimating if the user is speaking inspired by Dominant Speaker Identification for Multipoint Videoconferencing by Ilana Volfin and Israel Cohen


The estimate_is_speaking/2 function takes a list of audio levels in range [0, 127] and based on a threshold given as a second input computes if the person is speaking.

The input levels are interpreted on 3 tiers. Each tier consists of intervals specified below:

NameInterpretationNumber of RTP packets (default)length
immediatesmallest sound chunk1~20 [ms]
mediumone word1 * 10 = 10~200 [ms]
longhalf/one sentence1 * 10 * 7 = 70~1.4 [s]

Each tier interval is computed based on the smaller tier intervals (subunits). Immediates are computed based on levels, mediums on top of immediates and longs on top of mediums. The number of subunits in one interval is given as a module parameter.

Each interval is a number of active subunits (that is: lower tier intervals) above a threshold of a given tier.

Example

If level_threshold is 90, levels are [80, 90, 100, 90] and there are 2 levels in a immediate then immediates would be equal to [1, 2] since subunit of [80, 90] has 1 item above or equal to the threshold and subunit [100, 90] has 2 such items.

Same goes for mediums. If medium subunit threshold is 2 and number of subunits is 2 then mediums are equal to [1] since the only subunit [1, 2] had only one element above or equal to the threshold.

The number of levels the function requires equals the product of the subunits number required for each tier. This way we compute only one long interval because only one is needed. If the number of levels is smaller than the required minimum, the algorithm returns silence.

The most recent interval in each tier serves as a basis for computing an activity score. The activity score is a logarithm of a quotient of:

  • the probability of k active items in n total items under an assumption that a person is speaking (modeled as a binomial distribution)
  • probability same as above but under an assumption that a person is not speaking (modeled as an exponential distribution)

The activity score for each tier is then thresholded again. A threshold for every tier is given as a module parameter. If all activity scores are over the threshold, the algorithm returns that the input contains speech. Otherwise the algorithm returns that input contains silence.


Overall the parameters for each tier are:

ParameterDescription
subunitsnumber of smaller units in one bigger unit
score_thresholdnumber equal or above which the activity score of the given tier must be to be counted as indicating speech
subunit_thresholdnumber equal or above which the number of active subunits must be for the given tier to be marked as active (for immediates it is equal to the threshold given as a estimate_is_speaking/2 argument)
lambdaparameter for the exponential distribution (element of the activity score computations)

You can set them, by adding the following code to your config.exs

config :membrane_rtp_plugin,
  vad_estimation_parameters: %{
    immediate: %{
      subunits: 1,
      score_threshold: 0,
      lambda: 1
    },
    medium: %{
      subunits: 10,
      score_threshold: 20,
      subunit_threshold: 1,
      lambda: 24
    },
    long: %{
      subunits: 7,
      score_threshold: 20,
      subunit_threshold: 3,
      lambda: 47
    }
  }

A thorough explanation with images can be found in the Jellyfish book in the Voice Activity Detection chapter.

Summary

Functions

Estimates if the user is speaking based on the audio levels and a threshold.

Functions

Link to this function

estimate_is_speaking(levels, level_threshold)

View Source
@spec estimate_is_speaking([integer()], integer()) :: :speech | :silence

Estimates if the user is speaking based on the audio levels and a threshold.