Digital Audio Processing Fundamentals

Nature of sound and its engineering interpretation

What is Sound?

Sound is vibration of air (or other medium such as water) sensed by human ears.

  • Sound travels in air at a speed of around 330 m/s
    • this is called speed of sound and is usually treated as a very high speed.
  • Sound can travel in other media as well, such as wood, stone, steel, water, but NOT in vacuum.
    • In outer space, it should be completely silent

What is Air Pressure?

When something vibrates in the air, air will be disturbed leading to changes in air pressure.

  • The changes in air pressure will propagate in the speed of sound from the source to its surrounding.
  • Our ears can sense the changes in air pressure as sensation of sound.
  • When air is compressed/expanded, the pressure is higher/lower.

What is Waveform?

When an object vibrates, and we plot the variation of air pressure at some point in vicinity against time, we obtain a waveform

  • A common waveform is the Sine function.
  • The time over which a waveform repeats itself is called a period, the unit is second
  • Frequency is defined as the reciprocal of period, the unit is Hertz (Hz).

What is Spectrum?

All periodic waveforms (e.g. square, sawtooth, …etc.) can be broken down into a combination of weighted sine waves each of which is called a frequency component of the waveform.

  • The collection of all frequency components is called the spectrum of the signal.
  • The spectrum of a signal can be obtained by performing a spectral analysis.
    • For analog signal, spectral analysis is carried out with Fourier transform (FT).
    • For discrete signal, spectral analysis is carried out with discrete Fourier transform (DFT).

Audio is analog in Nature

Audio itself is analog in nature.

Digital systems employ sampling and quantization to transform the audio information.

  • Sampling detemine the bandwidth
    • Special precautions must be taken to prevent a condition of erroneous sampling known as aliasing.
  • Quantization determine the resolution
    • Quantization error occurs when the amplitude of an analog waveform is represented by a binary word, but its effects can be minimized by dithering the audio waveform prior to quantization.

Auditory system

Anatomy of the ear

Loudness

Not a linear scale

In the human auditory system,

  • we can actually hear from 0 to 100 or even more than 100 dB.
  • we can hear a large frequency range. from 20Hz to 16kHz

Why Sometimes we can’t hear some Sound?

Frequency Masking

Sound are masked below the “masking” threshold, dependent on the level and the frequency of the masker.

Time Masking

Sound are masked slightly before and after a sound incidence.

Perceptual attributes of sound

The 3 perceptual attributes of tones/sound

  • Loudness (related to amplitude)
  • Pitch (related to frequency)
  • Timbre (related to harmonics etc.)

Volume (Amplitude)

  • Ears respond to changing air pressure, which in turns deflects our eardrums, sending the perception of sound to our brains.
  • Volume level is referred to as sound pressure level (SPL) and it tells the loudness of a sound.
    • The standard unit of measurement for SPL is dB.
    • 0 dB SPL is the softest sound the average human ear can hear. (ie. threshold of hearing)

dB difference =20log10(spl of the measuredspl of the reference)=20log10(Air Pressure Ratio)d B \text { difference }=20 \log _{10}\left(\frac{\text {spl of the measured}}{\text {spl of the reference}}\right) = 20\log_{10}(\text{Air Pressure Ratio})

  • dB is in log scale.
  • +dB values represents a multiplication of sound pressure.
  • -dB values represents a division of sound pressure.

Examples:

  • +6 dB ≈ x 2 the air pressure
  • +20 dB = x 10 the air pressure
  • +40 dB = x 100 the air pressure
  • +60 dB = x 1,000 the air pressure
  • +80 dB = x 10,000 the air pressure

If an amplifier adds 1% distortion to the incoming signal, the amount of distortion is said to be 40 dB below the original signal. (i.e. noise/signal = SNR = 1/100)

Example Question

If we add 2 identical sounds together, what will be the increase in DB?

It is a constructive addition, so the Amplitude ratio is 2 to 1.

dB difference=20log102=+6dB\text{dB difference} = 20\log_{10}2 = +6 dB

Remarks:

10log10(Power2Power1)=10log10(Amplitude2Amplitude1)2=20log10(Sound pressure level2Sound pressure level1)10\log_{10}(\frac{\text{Power}_2}{\text{Power}_1}) = 10\log_{10}(\frac{\text{Amplitude}_2}{\text{Amplitude}_1})^2 = 20\log_{10}(\frac{\text{Sound pressure level}_2}{\text{Sound pressure level}_1})

What is the relation between loudness and sound pressure level in decibels? Is 80 dB twice as loud as 40 dB? How do you translate from decibels to loudness?

Sound level in dB is a physical quantity and may be measured objectively.
Loudness is a perceived quantity and one can only obtain measurements of it by asking people questions about loudness or relative loudness. (Different people have different answers)

Relating the two is called psychophysics. Psychophysics experiments show that subjects report a doubling of loudness for each increase in sound level of approximately 10dB.
(So roughly speaking, 50dB is twice as loud as 40dB, 60 dB is twice as loud as 50dB, etc.)

Since 80dB is 40dB more than 40dB, 80dB is roughly 2x2x2x2 = 16 times as loud as 40 dB.

Frequency

  • A sound wave is introduced into a medium by a vibrating object and it can be considered as a combination of sine waves.
  • Frequency tells how many cycles the wave repeats in a second.
  • The unit of frequency measurement is the Hertz (Hz)
  • As with volume, frequencies are also heard and often expressed logarithmically, and going down one octave divides the frequency in half.
  • The logarithmic frequency/volume relation corresponds to how our ears naturally hear.

Pitch

Each musical note has a fundamental frequency which determines its pitch.

  • Raising the pitch of a note by one octave represents a doubling of frequency

  • The distance between any 2 musical half-steps is equal to the 12th root of 2 (212\sqrt[12]{2} ≈1.0595 = r).
  • There are 12 musical half-steps in one musical octave.
  • Freq. of a note xrx r = Freq. of the next half-step
  • Freq. of a note xr12x r^{12} = Freq. of a note which is 1 octave higher

Example Question

Given that the pitch of a middle C note played with a flute is 261 Hz.

What are the pitches of the other C notes played with the same flute?

The pitches the C notes played with the same flute are 32.625, 62.25, 130.5, 261, 522, 1944, 2088, 4176 and 8352 Hz.

What do you expect the pitches of the D notes played with the same flute?

Note D is 2 half-steps above Note C.

pitch of D = pitch of C×r×r=C×212×212=C×2112×2112C \times r \times r = C \times \sqrt[12]{2} \times \sqrt[12]{2} = C \times 2^{\frac{1}{12}} \times 2^{\frac{1}{12}}

Pitches of D notes played with the same flute are:

36.62, 73.24, 146.48, 292.96, 585.93, 1171.9, 2343.7, 4687.4, 9374.8 Hz.

Timbre

Timbre is also known as tone color or tone quality from psychoacoustics.

  • Timbre is the quality of a musical note/sound/tone that distinguishes different types of sound production. (Even they have the same frequency)
  • It is affected by the harmonics, the signal’s envelope and etc.

Sampling

In practice, the continuous signal is sampled using an analog-to-digital converter (ADC).

The sampling theorem states that a continues bandlimited signal can be replaced by a discrete sequence of samples without loss of any information and describes how the original continuous signal can be reconstructed from the samples.

The theorem specifies that the sampling frequency must be at least twice the highest signal frequency.

  • This frequency is referred to as Nyquist frequency.

Aliasing

Aliasing is a consequence of violating the sampling theorem.

  • If the audio frequency is greater than half the sampling frequency, aliasing will occur
  • Alias frequencies appear back in the audio band, folded over from the sampling frequency.

Alias prevention

The solution is to bandlimit the input signal with a sharp lowpass filter (anti-aliasing filter) designed to provide significant attenuation at the Nyquist frequency, to ensure that the sampled signal never exceeds the Nyquist frequency.

  • Use Lowpass filter

An ideal anti-aliasing filter would have a “brick-wall” characteristic with instantaneous and infinite attenuation in the stopband.

In practice, it is designed with a transition band in which attenuation is achieved over a steeply sloping characteristic.

Quantization

Use fixed number of bits to represent the digital signal.

Quantization is the technique of measuring an analog event to form a numerical value.

Sampling represents the time of the measurement.

Quantization represents the value of the measurement.

Quantization error / Quantization distortion

In normal case, we use the closest bit for representation. (e.g. 6 or 7 bits for 1st sample, 7bits for 2nd). They will be some error because we can only choose the closest bit.

In some cases, we can only represent 0 or 1 bit, so the error is extremely big.

Math about Quantization

Consider a quantization system in which

  • nn is the number of bits,
  • NN is the number of quantization steps (N=2nN=2^n) (i.e. Total levels)
  • QQ is the quantizing interval (step size). (i.e. Quantization step)

Quantization of Uniform probability density function:

Lets assume the probability density function is:

lower range<x(n)<upper range\text{lower range} < x(n) < \text{upper range}

Let the word-length is nn bits.

Total levels = 2n2^n

Range = Upper range - Lower range

Quantization step=rangetotal levels\text{Quantization step} = \frac{\text{range}}{\text{total levels}}

Finding Mean.

Probability = p(x)p(x)

Since it is a Uniform probability density function, each probability has same chance (i.e. p(x)=1rangep(x) = \frac{1}{\text{range}})

Mean Value=mean(x)=lowerupperxp(x)dx=1rangelowerupperxdx\text{Mean Value} = mean(x) = \int_{\text{lower}}^{\text{upper}} x p(x) d x = \frac{1}{\text{range}}\int_{\text{lower}}^{\text{upper}} x d x

According to reverse power rule xndx=xn+1n+1\int x^{n} d x=\frac{x^{n+1}}{n+1} , further solving the mean value:

1rangelowerupperxdx=1range[x1+11+1]lowerupper=1range(upper22lower22)\frac{1}{\text{range}}\int_{\text{lower}}^{\text{upper}} x d x = \frac{1}{\text{range}}[\frac{x^{1+1}}{1+1}]^\text{upper}_\text{lower} = \frac{1}{\text{range}}(\frac{upper^2}{2} - \frac{lower^2}{2})

The variance is defined on the signal itself.

Variance=xrms2=lowerupperx2p(x)dx=1rangelowerupperx2dx\text{Variance} = x^2_{rms} = \int_{\text{lower}}^{\text{upper}} x^2 p(x) d x = \frac{1}{\text{range}}\int_{\text{lower}}^{\text{upper}} x^2 d x

According to reverse power rule xndx=xn+1n+1\int x^{n} d x=\frac{x^{n+1}}{n+1} , further solving the variance:

1rangelowerupperx2dx=1range[x2+12+1]lowerupper=1range(upper33lower33)\frac{1}{\text{range}}\int_{\text{lower}}^{\text{upper}} x^2 d x = \frac{1}{\text{range}}[\frac{x^{2+1}}{2+1}]^\text{upper}_\text{lower} = \frac{1}{\text{range}}(\frac{upper^3}{3} - \frac{lower^3}{3})

Then the Signal RMS

xrms=Variance=Standard Deviationx_{rms} = \sqrt{\text{Variance}} = \text{Standard Deviation}

Then find the Peak Factor, using xpeak=range2x_{peak} = \frac{\text{range}}{2}

Peak Factor=PF=xpeakxrms\text{Peak Factor} = P_F=\frac{x_{p e a k}}{x_{r m s}}

Then find the quantization noise’s NrmsN_{rms}

We assume the quantization noise also follow the uniform distribution.

Same integral method but the range is from +step/2 to -step/2

Nrms=step2step2xp(x)dxN_{rms} = \int_{\frac{-\text{step}}{2}}^{\frac{\text{step}}{2}} x p(x) d x

Nrms=Quantization step/2Peak FactorN_{rms} = \frac{\text{Quantization step/2}}{\text{Peak Factor}}

Finally the SNR (Signal-to-Noise ratio):

SNR=xrmsNrms\text{SNR} = \frac{x_{rms}}{N_{rms}}

And turn it in dB.

SNR in dB=20log(xrmsNrms)\text{SNR in dB} = 20\log(\frac{x_{rms}}{N_{rms}})

Quantization of Sine Signal:

For Sine Signal, Signal RMS (rms is the mean of the peak value):

Srms=1T0T(speaksin(wt))2dtS_{r m s}=\sqrt{\frac{1}{T} \int_{0}^{T}\left(s_{p e a k} \sin (w t)\right)^{2} d t}

Which speaks_{peak} equals to:

Peak value of the max. signal=speak=±QN2\text{Peak value of the max. signal} = s_{peak} = \pm \frac{QN}{2}

Then you eventually get this.

Srms=QN22=0.707QNS_{rms} = \frac{QN}{2\sqrt{2}} = 0.707QN

Then For Error in the Sine Signal:

r.m.s. quantization error=Erms=[e2p(e)de]1/2=[1QQ/2Q/2e2de]1/2=[Q212]1/2=Q23\text{r.m.s. quantization error} = E_{rms} = \left[\int_{-\infty}^{\infty} e^{2} p(e) d e\right]^{1 / 2}=\left[\frac{1}{Q} \int_{-Q / 2}^{Q / 2} e^{2} d e\right]^{1 / 2}=\left[\frac{Q^{2}}{12}\right]^{1 / 2}=\frac{Q}{2 \sqrt{3}}

Where

p(e)=1Qp(e) = \frac{1}{Q}

Finally to find the SNR:

(SrmsErms)2=3N22(\frac{S_{rms}}{E_{rms}})^2 = \frac{3N^2}{2}

SNR in dB:

SNR(dB)=SE(in dB)=10log(3N22)=6.02n+1.76SNR_{(dB) } = \frac{S}{E}_{(\text{in }dB)} = 10\log(\frac{3N^2}{2}) = 6.02n + 1.76

Inceasing the number of bits of quantization would increase the S/E ratio (higher dB).

Peak Factor (PF): the ratio of maximum value to the r.m.s value of an alternating quantity.

PF=SpeakSrmsP_F=\frac{S_{p e a k}}{S_{r m s}}

The quantization error is independent of the amplitude of the input signal but depends on the size of the quantization interval

Nrms=Quantization step/2Peak FactorN_{rms} = \frac{\text{Quantization step/2}}{\text{Peak Factor}}

  • The quantized signal might contain components above the Nyquist frequency;
    • thus, aliasing might occur.
    • The aliasing caused by quantization can create an effect called granulation noise, so called because of its gritty sound (沙沙聲) quality.

Other Quantization methods to improve sound quality

These algorithm decisions influence the efficiency of the quantization bits, as well as the relative audibility of the error.

  • A quantizer can use a nonlinear distribution of quantization intervals along the amplitude scale to maintain a constant SNR for signal of different amplitude.
  • Oversampling and noise shaping can be used to shift quantization error out of the audio band.
  • Dither is also a simple solution to address these quantization problems.
    • achieved by adding some noise

Dither

Dither can eliminate distortion caused by quantization, by reducing those artifacts to white noise.

  • With large amplitude complex signals, there is little correlation between the signal and quantization error; thus the error is random and perceptually similar to analog white noise. (That is we cannot hear the noise)
  • With low-level signals, the characteristics of the error changes as it becomes correlated to the signal, and potentially audible distortion results. (That is we can hear the noise, that’s the problem)
    • Then we might need Dither to eliminate distortion.

Without Dither:

With low-level signals, the characteristics of the error changes as it becomes correlated to the signal, and potentially audible distortion results. (Figure D)

With Dither:

A small amount of noise is added to the audio signal prior to sampling to linearize the quantization process.

  • Rather than quantizing only the input signal, the dither noise and signal are quantized together, and this randomizes the error. (Figure H)
  • Dither changes the digital nature of the quantization error into a white noise, and the ear can then resolve signals with levels well below one quantization level.
  • Dither increases the noise floor of the output signal.

Types of Dither

  • Generally differentiated by their probability density function.

  • Analog-to-Digital converter : Sampling and Quantization
  • Digital pseudo-random noise generator + Digital-to-Analog converter : Dither noise

Mathematically, with dither, quantization error is no longer a deterministic function of the input signal, but rather becomes a zero-mean random variable.

Equalization

Equalization is an effect that allows an user to control the frequency response of the output signal.

  • A user can emphasize (boost) or deemphasize (suppress) selected frequency bands in order to change the output sound.
    • The amount that a frequency band is boosted or suppressed is generally indicated in dB.

Equalization can be conveniently done using a filterbank.

  • The input signal is typically passed through a bank of 5-7 bandpass filters (BPF).
  • The output of the filters are weighted by the corresponding gain factors, and added to reconstruct the signal.
    • Each filter will have its own cut-off frequency (weighted)
    • (e.g. 20~1200 for 1st filter, 1200~2500 for 2rd filter, 2500~5000 for 3rd filter, 5000~10000 for 4th filter, 10000~20000 for 5th filter)
  • The filters are characterized by the normalized cut-off frequencies

 Normalized cutoff Frequency =2× Cut-off freq.  Sampling freq.\text { Normalized cutoff Frequency }=2 \times \frac{\text { Cut-off freq. }}{\text { Sampling freq.}}

The filters looks like this:

Low-bit conversion and Noise shaping

Sidenote:

Laplace Transform is used for Continuous Signal

Z Transform is used for Discrete Signal

Noise shaping

Noise shaping basically move noise away from the audio band (0 to 20kHz) to a higher frequency band, so that we cannot here the noise anymore.

audio band means the band that human can hear.

  • Noise shaping can be achieved with sigma-delta modulation.

SDM (Sigma-Delta Modulation)

Analyzing the SDM encoder:

Make use of Laplace transform

Y=(XY)×1S+NY = (X-Y)\times\frac{1}{S} + N

Using S=ejω=cos(ω)+jsin(ω)S = e^{j\omega} = cos(\omega) + jsin(\omega)

Purple line denote the Audio band.

Mathematical basis of 1st-order noise shaping

Note: cos(2a)=12sin2(a)\cos (2 a)=1-2 \sin ^{2}(a)

Basically:

Y=X+N(1z1)=X+N(H1(z))Y = X + N (1-z^{-1}) = X + N(H_1(z))

H1(z)=2sin(πffa)H_1(z) = 2| \sin \left(\frac{\pi f}{f_{a}}\right)|

NS means Noise Shaping.

  • The higher the oversampling rate, the more the quantization noise can be removed from the audio band.

To conclude:

  • The (11/z)(1-1/z) factor doubles the quantized noise power and at the same time shifts the noise to high frequencies.
  • The higher the oversampling rate, the more the quantization noise can be removed from the audio band.

Example: operation principle of a 1st-order noise shaper

  • A sequence of fixed output pattern is generated for a constant input
  • The local average of the output (repeated pattern) equals to the input

Higher-order noise shaping (n-order noise shaping)

The frequency response in N-order noise shaping:

Hn(z)=(11/z)n=Hn(f)=(2sin(πf/fa))nH_{n}(z)=(1-1 / z)^{n} = \left|H_{n}(f)\right|=\left.\left(2 \mid \sin \left(\pi f / f_{a}\right)\right|\right)^{n}

Nth-order noise shaping could employ cascaded sections.

Look at this 3rd-order noise shaper.

Which order should I use?

  • A successful noise shaping circuit thus seeks to balance a high oversampling rate with noise shaping order to reduce in-band noise and shift it away from audible range.
    • (Shift more and pay less cost)

Noise shaping with Dithering

Potential Problems of Low-order noise shaping :

  • The low-level linearity of low-order noise shaping circuits can be degraded by 2 problems known as idle patterns and thresholding.
  • A zero or very low level input may result in a regular 1010 pattern. If the period of the repetition of such patterns is long enough, they may be audible as a deterministic or oscillatory tone, rather than as noise.

In order to solve the Potential Problems of Low-order noise shaping :

  • To remove signal distortion, a noise shaping circuit must employ dithering.
  • The most basic form of dither is flat, white noise. (which you cannot hear)
  • Dither can be added to the input data so the circuit always operates with a changing signal even when the audio signal is zero or DC.
    • prevent 1010 pattern

No Dither and with Dither : Comparsion

White noise is flat, Color noise is not flat.

Conclusion - Use of noise shaping

  • Noise shaping is advantageous because a simple shaper can remove quantization noise from the audio band.
  • These algorithms are more effective at high sampling rates so there is more spectral space between the highest audio frequency and the Nyquist frequency.
  • Another possible objective of noise shaping is reduction in the number of bits required to represent the signal. (Quantization)
    • Noise shaping is good for quantization and hence can be used for such a purpose.
    • The so-called 1-bit stream technology used in audio processing equipments is based on this technique.
  • With any 1-bit system, because of the noise shaping employed, it is difficult to quote a meaningful figure for signal-to-noise ratio because the noise level varies with respect to frequency.
  • However, in general, a 1-bit system can provide an audio-band noise floor lower than that encountered in 16- or 18-bit conversion.

Application of Noise Shaping in Analog-to-digital Conversion

The Problem of Old conventional system

Annotation 1: Brick-wall characteristic is required to avoid aliasing and phase distortion. (LPF is not ideal)

Annotation 2: Analog circuit is sensitive to age, temperature and humidity.

Annotation 3: The quantization noise introduced distributes over the audio band.

Annotation 4: It’s technically difficult to maintain the linearity of a conventional multibit A/D converter; the quantization noise is generally input dependent and not white (audiable).

Annotation 5: The audio signal suffers from an amount of quantization noise and aliasing error.

Improve conventional system with Noise Shaping

We basically solved the noise problem and aliasing problem.

Note: It is Digital LPF, not LDF.

  • We use a oversampling (R×fsR \times f_s)
    • The higher the oversampling rate, the better the output performance.
    • Oversampling introduces a gap between successive bands, which resolves the potential aliasing problem caused by using the casual analog LPF and reduces the burden of the digital LPF used to avoid aliasing later on.
  • Digital LPF is more reliable than analog LPF to act as an anti-aliasing filter as it isn’t sensitive to humidity, age and temperature.
    • Digital LPF act as an anti-aliasing filter.
  • Quantization noise is introduced, but most of the audio band noise are shifted away with noise shaping
    • Only a little of quantization noise is introduced.