Speech Recognition

What is Automatic Speech Recognition (ASR)?

Definition: Converting audio signal to text/transcription

Applications of ASR Example: Amazon Alexa, Robotics, Audio Indexing , Voice dialing in smart phones and Nuance’s Dragon TV

Types of Speech Recognition Systems

Small vocabulary continuous speech recognition

Vocabulary of size under 100
Speakers can speak the words continuously without pause
Typically used for recognizing continuous spoken digits
E.g., Hong Kong Towngas’ meter reading reporting system

Large vocabulary continuous speech recognition (LVCSR)

Full vocabulary set of the language
Speakers can speak the words continuously without pause
Typically used for recognizing continuous phrases, e.g., dictation and transcription of broadcast news.
E.g., Google voice search, Dragon Dictation App for iOS, Dragon Naturally Speaking, Mac OS X Dictation, WeChat Voice Open Platform, InfoTalk-Recognizer, iFLYTEK Speech Recognition Engine, etc.

Isolated-word speech recognition

Small vocabulary
Speakers should provide pause between words
Typically used for voice command
Easy to implement
E.g., Truly Handsfree Voice Control

Keyword Spotting

Locating keywords from unconstrained speech
For crime analysis, QA for call centers, and voice command detection, wake-up word detection, etc.

Spoken Dialog Systems

Large vocabulary
Not only convert speech to text. The system also needs to understand the meaning and intention of the user and to give human-like response.
Can detect the emotional state of the speaker.
Some errors are fine as long as they do not affect the meaning of the spoken sentences.
Need natural speech synthesis
A field of natural language understanding
Apply to information access, financial transactions, and chatbot
E.g., Apple Siri, Microsoft Cortana, Google Assistant, Alexa

Evaluation of Speech Recognition Systems

Three types of recognition error:
- Substitution (the wrong word is recognized)
- Deletion (a word is omitted)
- Insertion (an extra word is recognized)

The word error rate (WER) is

$W E R=\frac{C(\text { substitution })+C(\text { deletion })+C(\text { insertion })}{N} \times 100 \%$

where $N$ is the number of words in the test speech.

Speech Features

Speech waveforms are non-stationary signals, because our vocal tract moves continuously while we speak.

A simple solution to deal with non-stationary signals is to process speech waveform frame-by-frame so that signal within each frame can be considered stationary.
Each frame has length of about 15-40 ms because our vocal track cannot move very fast.
- A Sliding window is used.

The most popular acoustic feature is mel-frequency cepstral coefficients (MFCCs).

It has a filter bank mimicking our cochlear.

HMM-based Speech Recognition

Hidden Markov Model (HMM) can be considered as an extension of Gaussian Mixture Models (GMMs).

Multiple GMM with State transition
Both GMM and HMM can be used as generative models

Operation of HMM-Based ASR

Speech =[Feature Extraction]=> MFCC =[Decoder]=> Words / Sentence

The Decoder contains Acoustic Models (GMM-HMM), Pronunciation Dictionary and Language Model

Unknown speech waveform is converted by a front-end signal processor into a sequence of acoustic vectors $X$ , typtically 39-dim MFCC
Given $X$ , the LVCSR system aims to determines the most probable word sequence
The language model postulates a word sequence (“This is speech” in this example) and determine its probability P(W).
A composite model for the word sequence is generated by concatenating a number of phone-based HMMs, where the phones are determined by a pronunciation dictionary.
The likelihood of the composite model generating the observed acoustic sequence is calculated, i.e., P(X|W).
This likelihood is then multiplied by the word-sequence probability P(W) obtained from the language model.
The process is repeated for all possible word sequences allowed by the language model, with the most likely sequence being selected as the recognizer’s output.

Language Modeling

The purpose of language modeling is to estimate the probability of some words, $W_{k}$ in an utterance given the preceding words:

$W_{1}^{k-1}=W_{1}, \ldots, W_{k-1}$

n-gram: $W_{k}$ depends only on the preceding $n-1$ words

$P\left(W_{k} \mid W_{1}^{k-1}\right)=P\left(W_{k} \mid W_{k-n+1}^{k-1}\right)$

It is effective for English because word order in English is important and the strongest contextual effects tend to come from neighboring words.
N-grams can be estimated from simple frequency counts and stored in a look-up table.
e.g., tri-gram

$P\left(W_{k} \mid W_{k-1}, W_{k-2}\right) \approx \frac{C\left(W_{k-2}, W_{k-1}, W_{k}\right)}{C\left(W_{k-2}, W_{k-1}\right)}$

where $C\left(W_{k-2}, W_{k-1}, W_{k}\right)$ is the number of times the trigram $\left[W_{k-2}\right.$ , $\left.W_{k-1}, W_{k}\right]$ appears in the training data and $C\left(W_{k-2}, W_{k-1}\right)$ is the number of times the bi-gram $\left[W_{k-2}, W_{k-1}\right]$ appears.
Using tri-gram, the prior probability of word sequence $W=\left\{W_{1}, \ldots, W_{K}\right\}$ is

$P(W)=\prod_{k=3}^{K} P\left(W_{k} \mid W_{k-1}, W_{k-2}\right)$

HMM for Isolated Word Recognition

V words are to be recognized and each word is modeled by a distinct HMM.
We have a training set of K utterances of each word, where each utterance constitutes an observation sequence.
To train and test an isolated-word speech recognizer, we
1. optimize the model parameters of each word-based HMM by maximizing the likelihood of the observation vectors derived from the corresponding word.
2. perform the processing shown in the next slide for each query word.
The number of states Q corresponds roughly to
1. the no. of sounds (phonemes) within the word, or
2. the average number of observations (MFCC vectors) in a spoken version of word.
For continuous models, we use as many as M=64—256 mixtures per states (if we have enough data).

DNN-based Speech Recognition

We can replace the HMM acoustic models by a DNN

Why DNN-HMM Works Better?

In GMM-HMM, we feed one acoustic vector to the HMM at a time. Correlation between frames can only be captured in the ΔMFCC and ΔΔMFCC
In DNN-HMM, we feed at least 11 consecutive acoustic vectors to the DNN at a time.
DNNs are very good at leveraging high correlation in the features. Therefore, we may use filter-bank features X(m) instead of MFCCs.

Speech Recognition: Tutorial

Q1

Q1: You are requested to develop a hidden Markov model (HMM) based speech recognizer that can recognize the words “Yes” and “No”.

Q1(a)

Q1a) Assuming that the occurrences of “Yes” and “No” have equal prior probability, draw a block diagram to illustrate the structure of the recognizer. Your diagram should contain blocks depicting the feature extractor, HMMs, and decision logic. (5 marks)

First turn the speech in MFCC, Then feed into HMM for each class. Get the loglikelihood and pick maximum to determine the output.

Q1(b)

Q1b) Outline the procedure for training the HMMs in this recognizer. (6 marks)

Firstly, collect many utterances of “Yes” and “No” from many speakers.

Then, Extract the acoustic vectors (energy and MFCCs plus their first and second derivative) from the speech regions of these utterances.

Finally, use the acoustic vectors from “Yes” to train an HMM to model the spectral and temporal characteristics of “Yes”, and use the acoustic vectors from “No” to train another HMM to model the spectral and temporal characteristics of “No”.

Q1©

Q1c) Given that both “Yes” and “No” comprise three phonemes, suggest the number of states in the HMMs. Briefly explain your answer. (4 marks)

2 HMMs are needed. (one for each class). The number of states for each HMM must not be less than 3. This is because each phonemes require 3 states to model and each word has 3 phonemes.

Q1(d)

Q1d) Denote the likelihoods of an acoustic sequence $\mathcal{X}$ as $p\left(\mathcal{X} \mid \Lambda_{\text {yes }}\right)$ and $p\left(\mathcal{X} \mid \Lambda_{\text {no }}\right)$ , where $\Lambda_{\text {yes }}$ and $\Lambda_{\text {no }}$ are the HMM of the words “Yes” and “No”, respectively. Also denote the prior probabilities for “Yes” and “No” as $P($ ‘Yes’) and $P($ ‘No’), respectively. If $P($ ‘Yes’) $=0.2$ , explain how you would use the likelihoods and the prior probabilities to classify $\mathcal{X}$ . Your answer should contain an expression relating the predicted word, the prior probabilities, and the likelihoods. (6 marks)

$\begin{aligned} l(\mathcal{X}) &=\underset{i \in\{\text { 'Yes','No'\} }}{\arg \max } P(i \mid \mathcal{X}) \\ &=\underset{i \in\{\text { 'Yes','No'\} }}{\arg \max } \frac{P(i) p\left(\mathcal{X} \mid \Lambda_{i}\right)}{0.8 p\left(\mathcal{X} \mid \Lambda_{\text {yes }}\right)+0.2 p\left(\mathcal{X} \mid \Lambda_{\text {no }}\right)} \\ &=\underset{i \in\{\text { 'Yes','No'\} }}{\arg \max } \frac{P(i) p\left(\mathcal{X} \mid \Lambda_{i}\right)}{p(\mathcal{X})} \\ &=\underset{i \in\{\text { 'Yes','No' }\}}{\arg \max } P(i) p\left(\mathcal{X} \mid \Lambda_{i}\right) \end{aligned}$

where $P('$ Yes’ $)=0.8$ and $P('$ No’ $)=0.2$ .

Q1(e)

Q1e) Are pronunciation dictionaries and language models necessary for this recognizer? Briefly explain your answer. (4 marks)

This recongizer is for a two-word recognition task, it is not necessary to use phone models. So pronunciation dictionaries and language models are not necessary. Instead, word models are used.

Q2

The following figure shows the architecture of a large-vocabulary speech recognition system.

Q2i

Q2i) Suggest a typical parametric representation of the speech waveform.

MFCC(1-12) + delta MFCC(1-12) + deltadelta MFCC(1-12) + Energy + delta Energy + deltadelta Energy, meaning that there are 39 features per acoustic vector => 39-dim MFCC.

Q2ii

Q2ii) What is the purpose of the pronunciation dictionary?

Pronunciation dictionary: convert the words in the hypothesized word sequence into a phone sequence.

Q2iii

Q2iii) Discuss the purpose of the language model.

Language model: postulates a word sequence (“this is speech” in the diagram) and determine its probability P(W).

In general, the language model estimates the probability of some words $W_k$ in an utterance given the preceding words $W_1^{k-1} = W_1, ... , W_{k-1}$ .

Q2iv

Q2iv) Discuss the purpose of the acoustic models. Explain how the models can be implemented.

Acoustic models: A composite model for the word sequence is generated by concatenating a number of phone-based HMMs, where the phones are determined by a pronunciation dictionary.

The likelihood of the observed data (MFCC sequence) given the composite model is calculated.

Q2v

Q2v) Explain why it is not necessary to remove the silence regions of the speech signal before performing front-end parameterization.

Because silence is also modeled by the acoustic models, silence regions will be automatically recognized and will be part of the phone string output by the recognizer.

Q3

Q3: Why HMMs are better than GMMs for speech recognition?

As mentioned, HMM can be considered as an extension of Gaussian Mixture Models (GMMs).

GMM has one state only, while HMM has multiple states.

As a result, GMMs ignore the temporal information embedded in the MFCC sequences. On the other hand, each state of a phone-based HMM contains a GMM that was trained to match the spectral characteristics (MFCCs) of part of a phone, e.g., the 1st state will match the front part of a phoneme. For a word-based HMM, each state was trained to match the acoustic characteristics of part of a word. With the state-dependent GMMs and the state-transition mechanism, an HMM can model the spectral-temporal characteristics of speech signals.

Q4

Q4: How the outputs of a DNN be used with HMMs in a DNN-HMM speech recognition system?

For example, in a monophone DNN-HMM system, we have 47 phones (including silence and short pause), each has 3 states. As a result, the DNN has 47x3 = 141 outputs.

Note: For context-dependent HMMs, we have around 9000 tied states, which means the DNN has 9000 output nodes. [Note: tied states were not covered. You may think of tied states as HMM states. In real ASR systems, each HMM represents a tri-phone instead of a phone.]

Assume that we have K states (K DNN outputs) as shown below.

Using Bayes rule, we can express the posterior in terms of likelihood $x$ prior:

$\operatorname{DNN}_{k}(X)=P\left(s_{k} \mid X\right)=\frac{p\left(X \mid s_{k}\right) P\left(s_{k}\right)}{P(X)} \propto p(X \mid s_k)P(s_k)$

where

$\operatorname{DNN}_{k}(X)$ is the output of the $k$ -th output node subject to the input $X$ (11 frames of MFCC vectors)
$P\left(s_{k}\right)$ is the prior probability of the phone state.
Then,

$p\left(X \mid s_{k}\right) \propto \frac{\mathrm{DNN}_{k}(X)}{P\left(s_{k}\right)} .$

Note: The prior $P\left(s_{k}\right)$ of HMM states can be obtained from the frequency of occurrences of the corresponding states in the forced alignment during DNN fine-tuning.

Q5

Q5: In DNN-HMM, how can we ensure that the output nodes of the DNN produce the posterior probabilities of phones?

We can use the softmax function to ensure that the DNN outputs are probabilities and sum to 1.0.

Here, we did not apply sigmoidal nonlinearity to the activation because we would like to send a negative value into the softmax function to indicate that an event has low probability.

Extra

In isolated-word speech recognition, we may use one hidden Markov model (HMM) for each English word. Explain why the number of states in the HMMs depends on the word being modeled. How would you determine the number of states for each HMM?

Because the number of syllables varies from word-to-word and each state in an HMM can only model a sub-part of a syllable, the required number of states for modeling a completed word varies from word-to-word. The rule of thumb is that the longer the word (number of letters), the larger the number of syllables. As a result, we need more states for modeling longer words. Each syllable requires at least 3 states to model. For examples, for words comprising three syllables, we need 9 states plus some short pause states.

Fig. Q4 shows the waveform and spectrogram of the phoneme [i:] in the word “speech” and “each”. Also shown is the waveform and spectrogram of the phoneme [I] in the word “it”.

(a) The value 0.6 in Fig. Q4 is the probability of remaining in the first state. Suggest the probability of transiting from State 1 to State 2 in the HMM. Show how you calculate this probability.

From state 1 to state 2 means [not state 1 to state 1], therefore $1-0.6 = 0.4$

(b) If the vertical dashed lines indicate the three sections of the phonemes. Deduce roughly the probability of remaining in the first state for the HMM that models the phoneme [I] in the word “it”. Briefly explain your answer.

The probability will be larger than 0 but smaller than 0.6. This is because the first section of the phoneme [I] is shorter. As a result, the chance of staying in State 1 becomes smaller.

© Denote $p\left(\mathcal{X}_{p} \mid \Lambda_{q}\right)$ as the likelihood of $\mathcal{X}_{p}$ given the HMM model $\Lambda_{q}$ corresponding to Phoneme $q$ , where $\mathcal{X}_{p}$ comprises the acoustic vectors (MFCCs) corresponding to Phoneme $p$ . State if the following conditions are true or false.

Condition 1: $p\left(X_{\mathrm{i:}} \mid \Lambda_{\mathrm{i}:}\right)>p\left(X_{\mathrm{i:}} \mid \Lambda_{\mathrm{I}}\right)$
Condition 2: $p\left(X_{\mathrm{i:}} \mid \Lambda_{\mathrm{i:}}\right)<p\left(X_{\mathrm{I}} \mid \Lambda_{\mathrm{i:}}\right)$
Condition 3: $p\left(X_{\mathrm{I}} \mid \Lambda_{\mathrm{I}}\right)=p\left(X_{\mathrm{i:}} \mid \Lambda_{\mathrm{i:}}\right)$

Briefly explain your answers.

Condition 1 is True. The MFCCs of i: match the HMM of i: on the left-hand-side of the inequality. On the other hand, there is a mismatch between the MFCCs and HMM on the right-hand-side of the inequality.

Condition 2 is False. The MFCCs match the HMM on the left. Its likelihood should be larger than the one on the right in which the MFCCs do not match the HMM.

Condition 3 is likely to be false. Because speech is stochastic, it is very unlikely that the two likelihoods are identical.

Speaker Recognition

What is Speaker Recognition ?

Two Types of Speaker Recognition: speaker identification and speaker verification

Speaker Identification is to identify one out of N speakers, where N can be very large (thousands)

Speaker verification is to verify whether the incoming voice belongs to the claimed speaker (binary classification problem)

Speaker verification can be divided into two types: text-independent and text-dependent

Text-independent SV:
- No restriction on the text, i.e., speaker can say anything
- Typically apply to forensic and security
- Sentences could be very long (in terms of minutes)
Text-dependent SV:
- Speakers are required to speak prompted phrases
- Typically apply to biometric authentication
- Prompt phrases are short (in terms of seconds)

Speaker Recognition applications include:

Biometric Authentication

Criminal Investigation

UI Personalization

Fraud Detection

Personalizing smart speakers

Information retrieval (–Speaker diarisation (who spoke when?) )

Forensic speaker recognition (–Used in court/criminal investigation for estimating the strength of evidence)

GMM-UBM Speaker Recognition

The acoustic vectors (MFCC) of speaker $s$ is modeled by a prob. density function parameterized by $\Lambda^{(s)}=\left\{\lambda_{j}^{(s)}, \mu_{j}^{(s)}, \Sigma_{j}^{(s)}\right\}_{j=1}^{M}$

$p\left(\mathbf{x} \mid \Lambda^{(s)}\right)=\sum_{j=1}^{M} \lambda_{j}^{(s)} p\left(\mathbf{x} \mid \mu_{j}^{(s)}, \Sigma_{j}^{(s)}\right)$

Gaussian mixture model (GMM) for speaker $s$ :

$\Lambda^{(s)}=\left\{\lambda_{j}^{(s)}, \mu_{j}^{(s)}, \Sigma_{j}^{(s)}\right\}_{j=1}^{M}$

The acoustic vectors of a general population is modeled by another GMM called the universal background model (UBM):

$p\left(\mathbf{x} \mid \Lambda^{(\mathrm{ubm})}\right)=\sum_{j=1}^{M} \lambda_{j}^{(\mathrm{ubm})} p\left(\mathbf{x} \mid \mu_{j}^{(\mathrm{ubm})}, \Sigma_{j}^{(\mathrm{ubm})}\right)$

Parameters of the UBM

$\Lambda^{(\mathrm{ubm})}=\left\{\lambda_{j}^{(\mathrm{ubm})}, \mu_{j}^{(\mathrm{ubm})}, \Sigma_{j}^{(\mathrm{ubm})}\right\}_{j=1}^{M}$

Universal Background Model (UBM) is used to compute the GMM model through the Maximum a Posteriori (MAP) process.

Computing a speaker model using a small amount of speaker-dependent data:

$\mu_{j}^{(s)}=\alpha_{j} E_{j}\left(X^{(s)}\right)+\left(1-\alpha_{j}\right) \mu_{j}^{(\mathrm{ubm})}$

When the no. of vectors in $X^{(s)}$ is small, $\alpha_{j} \approx 0$ so $\quad \mu_{j}^{(s)} \approx \mu_{j}^{(\mathrm{ubm})}$
When the no. of vectors in $X^{(s)}$ is large, $\alpha_{j} \approx 1$ so $\mu_{j}^{(s)} \approx E_{j}\left(X^{(s)}\right)=$ mean of the j-th mixture

GMM-UBM Scoring

Given the utterance of a claimant $c$ , speaker verification can be formulated as a 2 -class hypothesis problem:
- $H_{0}$ : MFCC sequence $X^{(c)}$ comes from a true speaker
- $H_{1}$ : MFCC sequence $X^{(c)}$ comes from an impostor

Verification score is a log-likelihood ratio:
$\operatorname{Score}\left(X^{(c)} \mid \boldsymbol{\Lambda}^{(s)}, \boldsymbol{\Lambda}^{\mathrm{ubm}}\right)=\log \frac{p\left(X^{(c)} \mid H_{0}\right)}{p\left(X^{(c)} \mid H_{1}\right)}=\log p\left(X^{(c)} \mid \boldsymbol{\Lambda}^{(s)}\right)-\log p\left(X^{(c)} \mid \boldsymbol{\Lambda}^{\mathrm{ubm}}\right)$

GMM-SVM Speaker Recognition

Perform better than GMM-UBM
Given the speech of a client speaker, use MAP adaptation to create his/her model.
Stack the mean vectors of the speaker model to form a supervector
Then, the supervector is used as input to an SVM classifier for verification

I-Vector Speaker Recognition

i-vector is a speaker-embedding method that represents the whole utterance by a low-dim vector.

Factor analysis model:

$\vec{\mu}_{i}=\vec{\mu}+\mathbf{T} \mathbf{w}_{i}$

where $\vec{\mu}$ is the UBM supervector, $T$ is the Low-rank total variability matrix,

Posterior mean of $w_i$ is the speaker-dependent i-vector.

Instead of using the high-dimension $\vec{\mu}_{i}$ to present the $i$ -th speaker, we use the low-dimension (typically 500) i-vector $\mathbf{x}_{i}=\left\langle\mathbf{w}_{i} \mid O_{i}\right\rangle$ to represent the speaker.
$\mathbf{T}$ is estimated by an EM algorithm using the utterances of many speakers. T represents the subspace in which the i-vectors vary.
Given $\mathbf{T}$ , estimate $\mathbf{x}_{s}$ for each target speaker $s$ and test utterance $\mathbf{x}_{t}$
Given an utterance from speaker $s$ and a total variability matrix $\mathbf{T}$ , we estimate his/her $\mathrm{i}-$ Vector $\mathbf{x}_{s}$
Because $\mathbf{T}$ defines the combined space describing both speaker variability and channel variability, we use LDA to remove channel variability

I-Vectors Scoring

Scoring means : compute the similarity between the two vectors.

Given the i-vector of target speaker and the i-vector of a test utterance, we compute the cosine-distance score:

$S_{\mathrm{CD}}\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right)=\frac{\left\langle\mathbf{W}^{T} \mathbf{x}_{s}, \mathbf{W}^{\mathrm{T}} \mathbf{x}_{t}\right\rangle}{\left\|\mathbf{W}^{T} \mathbf{x}_{s}\right\|\left\|\mathbf{W}^{T} \mathbf{x}_{t}\right\|} \quad S_{\mathrm{CD}}\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right) \in[0,1]$

where $\mathbf{W}$ is an LDA projection matrix, which project $\mathbf{x}$ to a space with less channel variability

If the score is larger than a threshold $\theta$ , then we accept the speaker; otherwise, we reject the speaker.

Probabilistic LDA (i-vector/PLDA)

Based on i-vectors but perform better than i-vectors for SV
The method is called iVector-PLDA or i-Vector/PLDA
The method assumes that there is a speaker subspace V within the i-vector space
The i-vector $x_s$ is written as:
- $x_s = m + Vz_s + \mathcal{E}_{s}$
- where
  - $x_s$ is i-vector extracted from the utterance of speaker s
  - $m$ is Global mean of all i-vectors
  - $V$ is defining Speaker subspace
  - $z_s$ is Speaker factor
  - $\mathcal{E}_{s}$ is Residual noise with covariance Σ

X-Vector Speaker Recognition (Deep Speaker Embedding)

Speaker embedding aims to represent the characteristics of a speaker from a variable-length utterance.

The TDNN can be placed by other module such as ResNet and DenseNet.
Statistical pooling aims to aggregate the frame-level information into segment-level information.
- In the x-vector network, it converts a C×T matrix to a 2C-dim vector

Performance Measures in Speaker Verification

Given a speaker verification system and a set of test utterances from some true-speakers and some impostors, we can obtain many true-speaker scores and impostor scores.
Then, we plot their histograms or approximate them by Gaussian PDFs
The PDF plots leads to 3 commonly used measures: EER, minDCF, and DET

Equal error rate (EER):

We adjust the decision threshold $\theta$ until the false acceptance rate (FAR) is equal to the false rejection rate (FRR)
False acceptance: an impostor is classified as a true-speaker
False rejection: a true-speaker is classified as an impostor

Detection cost function (DCF) is a weighted sum of the FAR and FRR:

$C_{\mathrm{DET}}(\theta)=0.1 \times \mathrm{FRR}(\theta)+0.99 \times \operatorname{FAR}(\theta)$

A larger weight on FAR means that false acceptance error is more serious

Minimum DCF is obtained by varying $\theta$ until $C_{\mathrm{DET}}(\theta)$ is minimum

$C_{\mathrm{DET}}\left(\theta_{\text {min }}\right)=0.1 \times \operatorname{FRR}\left(\theta_{\text {min }}\right)+0.99 \times \operatorname{FAR}\left(\theta_{\text {min }}\right)$

Because of the large weight on FAR, minDCF occurs at small FAR.

Voice Cloning

What is Voice Cloning?

Given an utterance of a target speaker, current technologies enable us to synthesize (text-to-speech) the speech of the target speaker

Voice cloning makes use of text-to-speech and speaker recognition technologies

The speaker encoder is a speaker-embedding network
Concatenating the speaker embedding vector of a target speaker and the encoder’s output vector enables the concatenated vector comprises the phonetic and speaker characteristics

Speaker Recognition: Tutorial

Q1

Q1 GMM-UBM, i-vectors, and i-vector/PLDA are the three common framework for text-independent speaker verification. All of them rely on the Gaussian mixture models (GMMs) instead of the hidden Markov models (HMMs). Explain why GMMs are more appropriate for text-independent speaker than the HMMs for text-independent speaker verification.

In text-independent speaker verification, there is no restriction on what the speakers will speak.

As a result, during both the training and recognition stages, it is impossible to compose a large HMM from a number of phone-based HMM as in speech recognition. Use a large-vocabulary speech recognizer to find the phone sequence of an utterance is not a good option because the phone sequence may be incorrect.

It turns out that using the delta cepstrum and delta-delta cepstrum to capture the short-term dynamic of MFCCs is better than using the incorrect phone sequences as obtained from an HMM-based speech recognizer.

Also, training of speaker-dependent phone HMMs requires each target-speaker to provide utterances covering all types of phonetic events, which is more demanding than training a speaker-dependent GMM.

Q2

Fig. Q2 shows the distributions of speaker scores and impostor scores in two speaker verification systems: System A and System B. The score distributions of System A and System B are depicted by the solid and dashed curves, respectively.

Q2i)

Q2i) Based on the distributions in Fig. Q2, plot the corresponding false-acceptance rate (FAR) and false-rejection rate (FRR) against the decision threshold θ for both System A and System B. Label the curves corresponding to System A and System B. Indicate on the decision-threshold axis the position of θ = –1.

Q2ii)

Q2ii) Hence, plot the detection-error-tradeoff (DET) curves for both systems. Label the curves corresponding to System A and System B.

Q3

Q3 Why does the total variability matrix T in i-vector speaker verification define both the speaker and session (channel) variability?

The total variability matrix T is trained by using the utterances of many speakers without using the speaker labels (unsupervised), like the training of PCA projection matrices.

If each training speaker provides many utterances coming from different channels or acoustic environments, matrix $TT^T$ defines not only the covariance of speaker characteristics but also the covariance of channels in the GMM-supervector space. In other words, the columns of T define the space in which the GMM-supervectors can vary and the variability can be due to speaker variation or channel variation.

Q4

Q4 Why is it important to perform LDA before computing the cosine distance scores in i-vector-based speaker verification?

Since the i-vectors contain both speaker and channel characteristics, it is important to suppress the channel characteristics by projecting the i-vectors to an even lower dimensional subspace, in which variation due to channel effects is minimal.

So the cosine distance reflects the difference in speakers rather than difference in channels.

Q5

Q5 In GMM-UBM and GMM-SVM speaker verification, given a sequence of acoustic vectors $\mathcal{X}^{(s)}$ from a client speaker $s$ , the maximum a posteriori (MAP) adaptation is used for adapting the universal background model (UBM) to create the speaker-dependent Gaussian mixture model (GMM). Typically, only the mean vectors of the UBM are adapted:

$\boldsymbol{\mu}_{j}^{(s)}=\alpha_{j} E_{j}\left(\mathcal{X}^{(s)}\right)+\left(1-\alpha_{j}\right) \boldsymbol{\mu}_{j}^{\mathrm{ubm}}, \quad j=1, \ldots, M$

where $M$ is the number of mixture components in the UBM, $E_{j}\left(\mathcal{X}^{(s)}\right)$ is the sufficient statistics depending on $\mathcal{X}^{(s)}$ , and $\boldsymbol{\mu}_{j}^{\mathrm{ubm}}$ and $\boldsymbol{\mu}_{j}^{(s)}$ are the $j$ -th mean vector of the UBM and the adapted GMM, respectively.

Q5i

**(i) Discuss the value of $\alpha_{j}$ when the enrollment utterance is very long and when the enrollment utterance is very short. **

When the no. of vectors in $X^{(s)}$ is small, $\alpha_{j} \approx 0$ so $\quad \mu_{j}^{(s)} \approx \mu_{j}^{(\mathrm{ubm})}$

When the no. of vectors in $X^{(s)}$ is large, $\alpha_{j} \approx 1$ so $\mu_{j}^{(s)} \approx E_{j}\left(X^{(s)}\right)=$ mean of the j-th mixture

When the enrollment is very long, $a_j \rightarrow 1$ so that the GMM means depend almost on the utterannce.

This is reasonable because when there are many acoustic vecotors in $X^{(s)}$ (the no. of vectors in $X^{(s)}$ is large).

When the utterance is very short, $a_j \rightarrow 0$ so that the GMM means are almost the same as the UBMs means.

This is reasonable because when there are not many acoustic vecotors in $X^{(s)}$ (the no. of vectors in $X^{(s)}$ is small), we better believe the prior, i.e. the UBM means.

Q5ii

(ii) In GMM-SVM speaker verification, we stack the mean vectors $\boldsymbol{\mu}_{j}^{(s)}$ for $j=$ $1, \ldots, M$ to construct a speaker-dependent supervector vector:

$\overrightarrow{\boldsymbol{\mu}}^{(s)}=\left[\left(\boldsymbol{\mu}_{1}^{(s)}\right)^{\top} \quad \cdots\left(\boldsymbol{\mu}_{M}^{(s)}\right)^{\top}\right]^{\top}$

**Why is it important to use MAP instead of directly applying the EM algorithm to compute $\boldsymbol{\mu}_{j}^{(s)}$ 's when constructing $\overrightarrow{\boldsymbol{\mu}}^{(s)}$ ? **

We cannot directly apply EM to compute $\boldsymbol{\mu}_{j}^{(s)}$ 's because the EM algorithm has no guarantee on the index arrangement in the mixture model.

This means that if we apply EM independently on individual speakers when computing their supervectors, the one-to-one correspondence between the subvectors $\boldsymbol{\mu}_{j}^{(s)}$ 's in $\overrightarrow{\boldsymbol{\mu}}^{(s)}$ will be lost for different target speakers. This one-to-one correspondence, however, can be guaranteed in MAP adaptation because $\boldsymbol{\mu}_{j}^{(s)}$ 's are computed one by one for each j.

Q6

Q6a

Q6a) Explain why the dimension of an i-vector is independent of the duration of its corresponding utterance. You may use the factor analysis model of the GMM-supervectors and the formulation of i-vectors to answer this question.

The i-vector of an utterance is the posterior mean of the latent factor of a factor analysis model:

$\mu=\mu^{(b)}+\mathbf{T w},$

where $\mathbf{T}$ is the total variability matrix and $\mathbf{w}_{i}$ is the latent factor.

Given an utterance with $T$ acoustic vectors $\mathcal{O}=\left\{\mathbf{o}_{1}, \ldots, \mathbf{o}_{T}\right\}$ , the corresponding i-vector is given by

$\mathbf{x} \equiv\left\langle\mathbf{w} \mid \mathcal{O}_{i}\right\rangle=\mathbf{L}^{-1} \sum_{c=1}^{C} \mathbf{T}_{c}^{\top}\left(\boldsymbol{\Sigma}_{c}^{(b)}\right)^{-1} \tilde{\mathbf{f}}_{c}$

where

$\begin{array}{c} \mathbf{L}=\mathbf{I}+\sum_{c=1}^{C} N_{c} \mathbf{T}_{c}^{\top}\left(\boldsymbol{\Sigma}_{c}^{(b)}\right)^{-1} \mathbf{T}_{c} \\ N_{c} \equiv \sum_{t=1}^{T} \gamma\left(\ell_{t, c}\right) \text { and } \tilde{\mathbf{f}}_{c} \equiv \sum_{t=1}^{T} \gamma\left(\ell_{t, c}\right)\left(\mathbf{o}_{t}-\boldsymbol{\mu}_{c}^{(b)}\right) \end{array}$

where $\gamma\left(\ell_{t, c}\right)$ is the posterior probability of the $c$ -th mixture of the UBM.

In this set of equations, the number of acoustic vectors, $T$ , depends on the utterance length.

However, because $\tilde{\mathbf{f}}_{c}$ is obtained by summing over all frames (from $t=1$ to $t=T$ ), the dimension of $\tilde{\mathbf{f}}_{c}$ is fixed, so as the dimension of $\mathbf{L}$ . Therefore, the dimension of $\mathbf{x}$ is independent of $T$ .

Q6b

Q6b) In biometric authentication, when the false acceptance rate (FAR) rises, the false rejection rate (FRR) will drop. Explain why there is always a tradeoff between FAR and FRR. How would you set the decision threshold if security is a major concern?

When the decision threshold is small, both true-speakers and imposters will be accepted by the system, which causes high false acceptance rate but low false rejection rate.

When the decision threshold is large, both true-speakers and imposters will be rejected by the system, which causes low false acceptance rate but high false rejection rate.

Thats why there is a tradeoff between FAR and FRR. If security is a major concern, the decision threshold should be set as very high.

Q6c

Acoustic modeling plays an important role in large vocabulary continuous speech recognition (LVCSR).

Q6c(i) Explain the purpose of the acoustic models in LVCSR. Describe how the phone-specific acoustic models can be used for acoustic modeling if you are given a short utterance comprising a few words.

The phone-specific acoustic models in LVCSR are to compute the likelihood of acoustic vector sequences given various phonetic units (can be phones or tri- phones).

Given a hypothesized word sequence, we use a dictionary to find the corresponding phone sequence. Then, an acoustic model is formed by joining the phone-specific acoustic models corresponding to the phones in the sequence. Then, the acoustic vectors are aligned to the internal states of the acoustic model, from which the likelihood of the whole acoustic vector sequence can be computed from the joined acoustic model.

Q6c(ii) GMM-HMM and DNN-HMM are two of the common acoustic models for LVCSR. State the advantages of these machine learning models for acoustic modeling.

GMM-HMM: The states (which are GMMs) can be easily shared across different phone-specific HMMs. It requires less data to train when compared with DNN-HMM.

DNN-HMM: It uses discriminative training to determine the weights of the DNN, which produces more accurate posterior probabilities of phones. It performs better than GMM-HMM provided that sufficient data are available for training the DNN. The input to the DNN contains multiple contextual frames instead of a single frame as in GMM, which allows the DNN to capture the dynamic of acoustic vectors.

Extra

2(a) Discuss the situations in which linear support vector machines (SVMs) are more appropriate than nonlinear SVMs for classification.

When the feature dimension > the number of training samples, linear SVMs could perfectly separate the two classes.

Therefore, using linear SVMs has less chance of overfitting the data.

Another situation is that when there is only one or two training samples from one class, using linear SVMs will be more desirable as the chance of overfitting the minority class is smaller.

2(b) Assume that you are given 100 speech recordings with unknown genders and that you want to identify the genders of these recordings. Assume also that you have a database comprising the speech recordings of 2,500 male and 2,500 female speakers. Each recording is represented by a 500-dimensional i-vector. You use the 5,000 i-vectors extracted from the database to train a linear support vector machine (SVM) to classify their genders. After training, you apply the SVM to identify the gender of the 100 recordings with unknown genders. Discuss the appearance in terms of feminineness and masculinity of the support vectors on

2b(i) the correct side of the decision boundary.

These support vectors correspond to the persons whose voice can hardly tell their gender, i.e., their voice is confusing in terms of feminineness and masculinity. However, the SVM is still able to identify their gender correctly.

2b(ii) the wrong side of the decision boundary.

These support vectors correspond to the persons whose voice sounds more like their opposite gender than their own gender. These vectors cause the gender identification problem non-linearly separable. The linear SVM needs to use non-zero slack variables (ξi > 1) to handle these vectors.

Q5 In i-vector based speaker verification, the dimension of the GMM-supervector $\overrightarrow{\boldsymbol{\mu}}_{s}$ corresponding to Speaker $s$ is reduced by the following factor analysis model:

$\overrightarrow{\boldsymbol{\mu}}_{s}=\overrightarrow{\boldsymbol{\mu}}+\mathbf{T w}_{s}$

where $\mathbf{T}$ is a low-rank matrix called the total variability matrix, $\mathbf{w}_{s}$ is a speaker factor and $\overrightarrow{\boldsymbol{\mu}}$ is a mean supervector formed by stacking the mean vectors of a universal background model.

Q5(a) Explain why this factor analysis model can reduce the dimension of $\overrightarrow{\boldsymbol{\mu}}_{s}$ .

Because the matrix $T$ is rectangular (low-rank) with the number of columns smaller than the number of rows, the dimension of $w_s$ is smaller than that of $\overrightarrow{\boldsymbol{\mu}}$ .

Q5(b) The i-vector $\mathbf{x}_{s}$ of Speaker $s$ is the posterior mean of $\mathbf{w}_{s}$ , which can be obtained from $\mathbf{T}$ and the acoustic vectors derived from his/her utterance. The i-vector $\mathbf{x}_{c}$ of a claimant is obtained in the same manner. During a verification session, we are given the i-vectors of Speaker $s$ and Claimant $c$ . A naive way is to accept Claimant $c$ if the cosine-distance score is larger than a decision threshold $\eta$ , i.e.,

$S_{\text {cosine }}\left(\mathbf{x}_{s}, \mathbf{x}_{c}\right)=\frac{\mathbf{x}_{s}^{\top} \mathbf{x}_{c}}{\left\|\mathbf{x}_{s}\right\|\left\|\mathbf{x}_{c}\right\|}>\eta$

where $^{\top}$ and $\|\cdot\|$ denote vector transpose and vector norm, respectively. Explain why this naive approach is undesirable for speaker verification. Suggest a method to pre-process the i-vectors to remedy this undesirable situation.

Because $T$ Defines the total variability space, the covariannce matrix $TT^T$ models thee channel variability as well as speaker variability. Therefore, the i-vectors contain not only speaker information but also channel information. The cosine distance scores will therefore be affect by the channel information, causing poor verification performance.

The problem can be addressed by applying LDA or LDA+WCCN on the i-vectors before computing the cosine distance scores. These pre-precessing methods suppress within-speaker variability but emphasize between-speaker variability, which has the effect of mitigating the channel effect on cosine-distance scores.