Digital audio compression and coding standards

Traditionally, audio recording systems have used objective parameters as their design goals - flat response, minimal noise, and so on.

Perceptual coders recognize that the final receiver is the human auditory system and make use it to code audio signals.

We only care about what we can hear.

Physiology of the human ear

lTwo fundamental phenomena that govern human hearing are the

minimum hearing threshold and masking.

The ear is most sensitive around 1 to 5 kHz.
The threshold of hearing curve describes the minimum level at which the ear can detect a tone at a given frequency.

Amplitude masking

Amplitude masking is in Frequency domain.

When you has something loud, it will mask the other frequency.

Amplitude masking occurs when a tone shifts the threshold curve upward in a frequency region surrounding the tone.
Amplitude masking is also referred to as simultaneous masking as it occurs when tones are sounded simultaneously.
The strong sound is called the masker and the softer sound is called the maskee.
The masking threshold describes the level where a tone is barely audible because of the existence of maskers.

Anything below this surface cannot be heard and hence will not be coded.

Note the masking actually affect more in the right side (higher frequency).

Simultaneous masking curves are asymmetrical in a way that the slope of the shifted curve is less steep on the high-frequency side.
As sound level of the masker increases, the threshold curve broadens, and in particular its upper slope decreases while lower slope remains relatively unaffected.

Temporal masking

Temporal masking is in Time domain.

Temporal masking occurs when tones are sounded close in time, but not simultaneously.
A louder tone appearing just after a softer tone overcomes the softer tone. (premasking)
A louder tone appearing just before a softer tone overcomes the softer tone. (postmasking)

Temporal masking increases as time differences are reduced.
Temporal masking decreases as the duration of the masker decreases.
Simultaneous masking is stronger than either pre- or post- masking because the sounds occur at the same time.

Combined effect of Amplitude masking and Temporal masking

Since there are 3 domain, it will be a 3D graph.

Amplitude and temporal masking form a contour that can be mapped in the time-frequency domain.

Perceptual coders identify this contour for changing signal conditions, and code the signal appropriately.

Rational for Perceptual Coding

Perceptual coding systems analyze the frequency and amplitude content of the input signal, compare it to a model of human auditory perception, and code it accordingly.

Tests show that ratios of 4:1 or 6:1 can be transparent. (i.e. we can still hear, while cannot notice the change)

Performance of perceptual coding

The performance of perceptual coding is based on the following factors:

Only audible information is coded.
Bits are assigned according to audibility.
Quantization error is confined in a critical band.

Audible information is coded according to its significance.

Perceptual coding is tolerant of errors.
- With PCM (Staircase pulse), an error introduces a broadband noise.
- With most perceptual coders, the error is limited to a narrow band (not audible, we dont care) corresponding to the bandwidth of the coded critical band, thus limiting its loudness.

Coding Techniques

Audio coders operate over a block of samples.

How should we determine the block size?

The larger the block size provides the lower the time resolution but the higher the frequency resolution.
An amplitude masking threshold of finer frequency resolution can be obtained with a longer block.

At the same time, a block must be kept short to stay within the temporal resolution of the ear.
A good balance in block size is required.

Most coders overlap successive blocks in time by 50% or so to reduce block effect.

Subband coding:

Blocks of consecutive time-domain samples representing the boardband signal are collected over a short period and applied to a digital filter bank.
The filter bank divides the signal into multiple bandlimited channels to approximate the critical band response of the human ear.
- Synthesis filter bank sums the subband signals to reconstruct the output broadband signal.
Frequency Analysis (by DFT) guide the bit allocation
- Important signals assigned more bit

Each subband is coded independently.

A subband’s masking level is derived with a pre-defined psychoacoustic model.
- The psychoacoustic model tells how a masker shifts the local masking curve!
The signals below the minimum or masking curve are not coded. (Because not audiable-> we dont care)

Blue indicate a inaudible signal. Green indicate a Audiable signal.

The signal-to-mask ratio (SMR) of a particular subband is:
- the difference between the maximum signal and the masking threshold in that subband.
- used to determine the number of bits assigned to a subband.
The number of bits given to any subband must be sufficient to yield a requantization noise level that is below the masking level.
The quantization noise in a subband
- is limited to that subband and should be masked by the audio signal in that subband.

Max acceptable quantization noise is the lowest in each band.

MPEG-1 Audio standard

The audio portion of MPEG1 standard (11172-3) has found many applications such as VCD, CD-ROM, and digital audio broadcasting.

It supports coding of 32, 44.1 and 48 kHz PCM data at bit rates of 32 to 192 kbps/channel.

The standard describes three layers of coding.

Layer I describes the least sophisticated method and operates at 192 kbps/channel.
Layer II is based on layer I and operates at 96-128 kbps/channel. (More compression)
Layer III is conceptually different from I and II, and operates at 64 kbps/channel.
- known as MP3

Layers I and II are based on MUSICAM (Masking-pattern Universal Subband Integrated Coding And Multiplexing) coding algorithm.

Layer III is based on both MUSICAM and ASPEC (Adaptive Spectral Perceptual Entropy Coding).

Generally, MPEG-audio Encoder

The Psy. model define how will the maskers (the Arrow) mask. (Masking Curve Derivation)

Then we use blue color to incidate the non-audiable sounds.

See below example for better understanding.

Example - Masking Curve Derivation and Bit Allocation

A perceptual audio codec is used to compress an audio signal. The codec groups every 4 barks into a subband and then allocates bits to different subbands according to the result of a spectrum analysis based on a psychoacoustic model. All samples in the same subband are quantized with the same quantizer, and the bit resolution of which is allocated by the codec. (The Bark scale is a psychoacoustical scale proposed by Eberhard Zwicker in 1961.)

(i) Locate the potential maskers.

Potential maskers are all the audiable local maximum.

Blue ovals are also local maximum but they are not hearable, therefore they are not potential maskers.

Red ovals are the potential maskers.

Positions of 7 potential maskers: bark 7, 11, 14, 15, 18, 21 and 23.

(ii) Based on the given psychoacoustic model, derive the masking threshold.

Fig. 1b is the psychoacoustic model. Apply the mask on each maskers.

(iii) Determine the Signal-to-Mask levels of each subband.

SMR in each subband = Highest - Lowest Mask

Subband 1: 0 dB

Subband 2: 45 - 18 = 27 dB

Subband 3: 0 dB

Subband 4: 60 - 35 = 25 dB

Subband 5: 50 - 42 = 8 dB

Subband 6: 85 - 50 = 35 dB

Subband 7: 0 dB

Subband 8: 0 dB

(iv) Suppose allocating one additional bit to a subband results in a 6dB drop of the noise floor in that subband. Allocate an appropriate number of bits to all subbands.

6dB = 1 bit

$\lceil\frac{SMR_{dB}}{dB}\rceil = bit$

Subband 1: 0 bit

Subband 2: 27 / 6 = 5 bit

Subband 3: 0 bit

Subband 4: 25 / 6 = 5 bit

Subband 5: 8 / 6 = 2 bit

Subband 6: 35 / 6 = 6 bit

Subband 7: 0 bit

Subband 8: 0 bit

Practical Example : Use a perceptual model in audio coding

MPEG audio coders

A higher layer makes a better use of the psychoacoustic model and hence higher compression rate can be achieved.
The 3 layers require increasing levels of complexity (and hence cost) to achieve a particular perceived quality, the choice of layer and bit rate is often a compromise between the desired perceived quality and the available bit rate.

MPEG-1 Layer I

Layer I is a simplified version of the original MUSICAM standard.

A polyphase filter bank is used to split the wideband signal into 32 subbands of equal width.
Adjacent subbands overlap, and the filter bank and its inverse are not lossless.

SMR determines the minimum signal-to-noise ratio that has to be met by the quantization of the subband samples.

When available, additional bits are added to codewords to increase the S/N ratio above the minimum.

Overview: Encoder and Decoder

In the encoder:

In general, more bits will be allocated to subbands of higher SMRs.
Subbands judged inaudible are given a zero allocation.

Keypoints:

The wideband signal is split into 32 subbands with a polyphase filter.
For spectrum analysis, the FFT analysis block size is 512

MPEG-1 layer II

Layer II is essentially identical to the original MUSICAM standard. Layer II is similar to layer I, but more sophisticated in design.

The Block diagram is exactly the same as Layer I.

Keypoints:

The wideband signal is split into 32 subbands with a polyphase filter.
For spectrum analysis, the FFT analysis block size is increased to 1024
- limproves the frequency resolution of the masking curve
Tonal and nontonal (noise) components are distinguished to better determine their effect on the masking threshold.

MPEG-1 Layer III (MP3)

Layer III combines elements from MUSICAM and ASPEC, and is more complex than Layers I and II.

Techniques used:

A combined version of subband and transform coding (DCT) – higher spectral resolution
Adaptive window selection (long 36/short 12/long-short/short-long)
Mixed mode ( low band uses long window, high band uses short window)
Noise allocation

A combined version of subband and transform coding (DCT) – higher spectral resolution

Keypoints:

The wideband signal is split into 32 subbands with a polyphase filter.
- Additionally, each subband is transformed into 18 spectral coefficients by a modified discrete cosine transform(MDCT) for a maximum of 576 coefficients each representing a frequency band of equal width.
- This provides good spectral resolution that is needed for steady state signals.
For spectrum analysis, the FFT analysis block size is 1024
- limproves the frequency resolution of the masking curve
Transform coding (DCT) and Huffman Coding used

Adaptive window selection (long 36/short 12/long-short/short-long)

Long window => Good frequency resolution, Poor temporal resolution
Short window => Poor frequency resolution, Good temporal resolution

Temporal resolution is

more critical than frequency resolution when analyzing an transient signal.
- Use short windows(12 samples) for transient signals
less critical than frequency resolution when analyzing an steady signal.
- Use long windows(36 samples) for steady signals

Mixed mode ( low band uses long window, high band uses short window)

High frequency components change more quickly than low frequency components.

Temporal resolution is more critical than frequency resolution when handling high-frequency subbands.
- Use short windows for high-frequency subbands
Temporal resolution is less critical than frequency resolution when handling low-frequency subbands.
- Use long windows for low-frequency subbands

Noise allocation

A noise allocation iteration loop is used to calculate optimal quantization noise in each subband.

An analysis-by-syntheses method calculates a quantized spectrum that satisfies the noise requirements of the modeled masking threshold
This is referred to noise allocation, as opposed to bit allocation.

The bit allocation is SMR-based,

while the noise allocation is based on the real noise power of the final outcome.

Huffman coding is used for both scale factors and coefficients.
The data rate varies from frame to frame and Layer III is a variable rate coding algorithm.