About silence and talk over events

NOTE   This feature is supported in an Avaya Aura Contact Center with MLS environment. However, it is not supported in an Avaya Aura Contact Manager environment due to a limitation with Avaya Recording.

A recorded call contains two streams of audio that represent the two sides of a call. In the Media Player, the Audio panel displays the inbound stream in blue and the outbound stream in red. In a normal conversation, the energy alternates between the outbound call and the inbound call.

When the inbound call and the outbound call spike simultaneously, that is a talk over event. The Audio panel displays a talk over icon in the energy bar where a talk over event occurs. When both parties are silent during a call, that is a silence event. During a silence event, the line in the energy bar is flat. The Audio panel displays a Silence icon in the energy bar where a Silence event occurs.

Normally, each stream contains the voice of a single person: either the agent or the customer. Occasionally, a stream includes multiple voices. For example, a conference call contains the agent stream where you hear the agent’s voice and a second stream where you hear the voices of all other parties in the conference call.

Calls can include non-speech noises (for example, wind, typing, background conversions, or barking dogs). Calabrio ONE processes these noises in addition to speech when searching for silence and talk over events in a call. Brief background noises might display as audio energy, but Calabrio ONE still considers those silence.

Calabrio ONE uses a Voice Activity Detection (VAD) module to classify audio as silence or speech. VAD is designed to analyze phone calls where you expect to hear two or more people talking to each other. VAD analyzes separate blocks of audio data and calculates an average sound volume for each block. The blocks are called frames. (A frame size is measured in milliseconds of audio. VAD uses the same frame size when processing all audio in a file.) VAD uses its decision threshold to determine if each frame indicates silence or speech. If the average volume for the frame falls below the VAD decision threshold, it marks the frame as mutual silence. VAD processes each frame of each stream, compares the frames from stream 1 and stream 2, and assigns an audio type to each pair of frames. The audio types are as follows:

  • Mutual Silence (MS)—Both frames are silent.
  • Normal (N)—One frame contains speech, and the other frame is silent. This indicates normal conversation.
  • Talk Over (TO)—Both frames contain speech.

VAD uses a heuristic algorithm that adapts based on the quality of the audio data. In a noisy environment, the VAD decision threshold rises to mark only the loudest noises as speech. Otherwise, the entire phone conversation would be marked as constant speech, even if the noise is caused by a car engine or another form of non-speech background noise. In a quiet environment where the person is not speaking loudly, the VAD decision threshold falls so that it can correctly identify speech at a low volume. This allows the entire call to be marked as normal speech instead of silence.

This adaptability allows VAD to be more accurate when detecting speech or silence, but it is not always 100% accurate. Because VAD uses average sound volume to tell the difference between speech and silence, there will always be instances where it incorrectly identifies normal speech or mutual silence in a frame of audio. When background noise levels change, VAD needs a few seconds to adapt. During this time, it might mark audio as normal speech when no one is speaking, or it might mark mutual silence when someone is speaking. During mutual silence, for example, a sudden noise like typing on the keyboard or coughing might be loud enough to cause VAD to identify a frame as talking even though no one is speaking. Essentially, VAD does not know the difference between human speech and the sound of a car engine.

It is also possible that VAD might not identify a talk over or silence event. For example, it might miss a talk over event even when two people are clearly talking to each other on a call at the same time. If one of the speakers during the talk over event pauses to think or take a breath for at least a quarter of a second, VAD could mark the frame as an instance of silence. From the speaker’s perspective, they were constantly talking; you would expect VAD to indicate a talk over event. From VAD’s perspective, however, there was a period of silence during the conversation, so it cannot be considered a talk over event.

On the Application Management > QM > QM Configuration > Global Settings page, you can establish the minimum duration of silence or talk over to be considered an event. For each event, Calabrio ONE saves the type (silence or talk over), the duration of the event in milliseconds, and the start of the event as an offset from the beginning of the audio.