Radial Basis Function Network (RBFN)

Radial basis function network (RBFN) represent a special category of the feedforward neural networks architecture.

Used in:

often used for nonlinear mapping of complex processes and
for solving a wide range of classification problems.
have been used as well for control systems, audio and video signals processing, and pattern recognition.
have also been recently used for chaotic time series prediction, with particular application to weather and power load forecasting.

The basic RBFN structure consists of:

an input layer,
a single hidden layer with radial activation function,
and an output layer.

Generally, RBF networks have an undesirably high number of hidden nodes, but the dimension of the space can be reduced by careful planning of the network.

Key feature of RBFN:

No backpropagation, unlike Neural Networks
RBF kernel is non-linear
- nonlinear transformations in its hidden layer
- shared weights, unlike Neural Networks
  - the connection weights between the input layer and the neuron units of the hidden layer for an RBFN are all equal to unity
No activation function
- linear transformations between the hidden and output layers
- The rationale behind this is that input spaces, cast nonlinearly into high-dimensional domains, are more likely to be linearly separable than those cast into low-dimensional ones.
unsupervised initialization, supervised training

Radial Function / Kernel Function

In general the form taken by an RBF function is given as:

$g_i(x) = r_i (\frac{\| x-v_i\|}{\sigma_i})$

where $x$ is the input vector,
$v_i$ is the vector denoting the center of the radial function $g_i$ ,
$\sigma_i$ is width parameter

Gaussian kernel function

The Gaussian kernel function is the most widely used form of RBF given by:

$g_i(x) = \exp(\frac{- \|x-v_i\|^2}{2\sigma^2_i})$

Logistic function

The logistic function has also been used as a possible RBF candidate:

$g_i(x) = \frac{1}{1+\exp(\frac{\|x-v_i\|^2}{\sigma^2_i})}$

Learning Algorithm of RBFN

The standard technique used to train an RBF network is the hybrid approach.

Hybrid Approach

Train the RBF layer to get the adaptation of centers and scaling parameters using the unsupervised training.
- initialization: determine the centers for the RBF networks
  - K-means method,
  - ”Maximum likelihood estimate” technique,
  - Self-organizing map method.
- accurate knowledge of $v_i$ and $σ_i$ has a major impact on the performance of the network.
Then Adapt the weights of the output layer using supervised training algorithm.
- training: update the weights between the hidden layer and the output layer
  - Least-squares method,
  - Gradient method.

Because the weights exist only between the hidden layer and the output layer, it is easy to compute the weight matrix for the RBFN.

Advantages/Disadvantages of RBFN

+ve: RBFN trains faster than a MLP.
+ve: the hidden layer of RBFN is easier to interpret than the hidden layer in an MLP.
-ve: Unsupervised learning stage of an RBFN is not an easy task.
-ve: RBFN Inference speed slower than MLP

Kohonen’s Self-Organizing Network

Used in:

Speech recognition, Vector coding, Robotics applications, and Texture segmentation.
Belongs to the class of unsupervised learning network
- This means that the network, unlike other forms of supervised learning based networks updates its weighting parameters without the need for a performance feedback from a teacher or a network trainer.

Key feature of KSON:

the nodes distribute themselves across the input space to recognize groups of similar input vectors.
the output nodes compete among themselves to be fired one at a time in response to a particular input vector.
the nodes of the KSOM can recognize groups of similar input vectors.
- Two input vectors with similar pattern characteristics excite two physically close layer nodes.
Unsupervised

Competitive Learning

It is based on the competitive learning technique also known as the winner takes all strategy.

Presume that the input pattern is given by the vector x.
Assume $w_{ij}$ is the weight vector connecting the input elements to an output node with coordinate provided by indices i and j.
$N_c$ defined as the neighborhood around the winning output candidate.
Its size decreases at every iteration of the algorithm until convergence occurs.

Initialization stage:

Initialize all weights to small random values.
Set a value for the initial learning rate $\alpha$
Set a value for the neighborhood $N_c$ .

Training stage:

Choose an input pattern x from the input data set.
Select the winning unit $c$ $c$ (the index of the best matching output unit) such that the performance index $I$ $I$ given by the Euclidian distance from x to $w_{ij}$ $w_{ij}$ is minimized:
- $I = \| x- w_c \| = \min_{ij}\|x-w_{ij}\|$
Then Update the weights according to the global network updating phase from iteration k to iteration to iteration k+1 as
- $w_{ij}(k+1) = w_{ij}(k) + \alpha(k)[x-w_{ij}(k)]$ if $(i,j) \in N_c(k)$
- otherwise don’t update.
  - $a(k)$ is the adaptive learning rate (strictly positive value smaller than the unity),
  - $N_c(k)$ the neighbourhood of the unit $c$ at iteration $k$
The learning rate and the neighborhood are decreased at every iteration according to an appropriate scheme.
- For learning rate scheme, e.g. a shrinking function $\alpha(k) = \alpha(0)(1-k/T)$ $α (k) = α (0) (1 - k / T)$
  - T = the total number of training cycles
  - $a(0)$ is the initial learning rate
- For the neighbourhood
  - initial region with the size of half of the output grid and shrinks according to an exponentially decaying behaviour.
The learning scheme continues until enough number of iterations has been reached or until each output reaches a threshold of sensitivity to a portion of the input space.

Competitive Learning of KSON: Example

A Kohonen self-organizing map is used to cluster four vectors given by:

(1, 1, 1, 0),
(0, 0, 0, 1),
(1, 1, 0, 0),
(0, 0, 1, 1).

The maximum numbers of clusters to be formed is m = 3.

Suppose the learning rate (geometric decreasing) is given by:

Initial learning rate : $\alpha(0) = 0.3$
$a(t+1) = 0.2\alpha(t)$

With only three clusters available and the weights of only one cluster are updated at each step (i.e., Nc = 0), find the weight matrix. Use one single epoch of training.

The initial weight matrix is in [m=3 x vector_size=4] dimension.

$W=\left[\begin{array}{lll} 0.2 & 0.4 & 0.1 \\ 0.3 & 0.2 & 0.2 \\ 0.5 & 0.3 & 0.5 \\ 0.1 & 0.1 & 0.1 \end{array}\right]$

Initial radius: $N_c=0$
Initial learning rate: $\alpha(0)=0.3$

For Pattern 1: the first input vector $x = (1, 1, 1, 0)$

$I(1) = (1-0.2)^2 + (1-0.3)^2 + (1-0.5)^2 + (1-0.1)^2 = 1.39\\ I(2) = (1−0.4)^2 +(1−0.2)^2 +(1−0.3)^2 +(0−0.1)^2 = 1.5\\ I(3) = (1−0.1)^2 +(1−0.2)^2 +(1−0.5)^2 +(0−0.1)^2 = 1.71$

The input vector is closest to output node 1. Thus node 1 is the winner. The weights for node 1 should be updated.

$\begin{equation} \begin{aligned} & w^{\text {new }}_{(1)}=w^{\text {old }}_{(1)}+\alpha\left(x-w^{\text {old }}_{(1)}\right) \\ & =(0.2,0.3,0.5,0.1)+0.3(1-0.2,1-0.3,1-0.5,0-0.1) \\ & =(0.2,0.3,0.5,0.1)+0.3(0.8,0.7,0.5,-0.1) \\ & =(0.44,0.51,0.65,0.07) \\ & W=\left[\begin{array}{lll} 0.44 & 0.4 & 0.1 \\ 0.51 & 0.2 & 0.2 \\ 0.65 & 0.3 & 0.5 \\ 0.07 & 0.1 & 0.1 \end{array}\right] \\ & \end{aligned} \end{equation}$

For Pattern 2: the second input vector $x = (0, 0, 0, 1)$

Note that it uses the updated weights.

$I(1)=(0−0.44)^2 +(0−0.51)^2 +(0−0.65)^2 +(1−0.07)^2 = 1.7411\\ I(2)=(0−0.4)^2 +(0−0.2)^2 +(0−0.3)^2 +(1−0.1)^2 =1.1 \\ I(3)=(0−0.1)^2 +(0−0.2)^2 +(0−0.5)^2 +(1−0.1)^2 =1.11$

The input vector is closest to output node 2. Thus node 2 is the winner. The weights for node 2 should be updated.

$\begin{equation} \begin{aligned} & w^{\text {new }}_{(2)}=w^{\text {old }}_{(2)}+\alpha\left(x-w^{\text {old }}_{(2)}\right) \\ & =(0.4,0.2,0.3,0.1)+0.3(0-0.4,0-0.2,0-0.3,1-0.1) \\ & =(0.4,0.2,0.3,0.1)+0.3(-0.4,-0.2,-0.3,0.9) \\ & =(0.28,0.14,0.21,0.37) \\ & W=\left[\begin{array}{lll} 0.44 & 0.28 & 0.1 \\ 0.51 & 0.14 & 0.2 \\ 0.65 & 0.21 & 0.5 \\ 0.07 & 0.37 & 0.1 \end{array}\right] \\ & \end{aligned} \end{equation}$

For Pattern 3: the third input vector $x = (1, 1, 0, 0)$

Note that it uses the updated weights.

$I(1)=(1−0.44)^2 +(1−0.51)^2 +(0−0.65)^2 +(0−0.07)^2 = 0.68\\ I(2)=(1−0.28)^2 +(1−0.14)^2 +(0−0.21)^2 +(0−0.37)^2 = 1.439\\ I(3)=(1−0.1)^2 +(1−0.2)^2 +(0−0.5)^2 +(0−0.1)^2 =1.71$

The input vector is closest to output node 1. Thus node 1 is the winner. The weights for node 1 should be updated.

$\begin{equation} \begin{aligned} & w^{\text {new }}_{(1)}=w^{\text {old }}_{(1)}+\alpha\left(x-w^{\text {old }}_{(1)}\right) \\ & =(0.44,0.51,0.65,0.07)+0.3(1-0.44, 1-0.51, 0-0.65. 0-0.07) \\ & =(0.44,0.51,0.65,0.07)+0.3(0.56, 0.49, -0.65. -0.07) \\ & =(0.608,0.657,0.455,0.049) \\ & W=\left[\begin{array}{lll} 0.608 & 0.28 & 0.1 \\ 0.657 & 0.14 & 0.2 \\ 0.455 & 0.21 & 0.5 \\ 0.049 & 0.07 & 0.1 \end{array}\right] \\ & \end{aligned} \end{equation}$

For pattern 4: the fourth input vector $x = (0, 0, 1, 1)$

Note that it uses the updated weights.

$I(1)=(0−0.608)^2 +(0−0.657)^2 +(1−0.455)^2 +(1−0.259)^2 = 2.00\\ I(2)=(0−0.28)^2 +(0−0.14)^2 +(1−0.21)^2 +(1−0.37)^2 = 1.119\\ I(3)=(0−0.1)^2 +(0−0.2)^2 +(1−0.5)^2 +(1−0.1)^2 =1.11$

The input vector is closest to output node 3. Thus node 3 is the winner. The weights for node 3 should be updated.

$\begin{equation} \begin{aligned} & w^{\text {new }}_{(3)}=w^{\text {old }}_{(3)}+\alpha\left(x-w^{\text {old }}_{(3)}\right) \\ & =(0.1,0.2,0.5,0.1)+0.3(0-0.1, 0-0.2, 1-0.5, 1-0.1) \\ & =(0.1,0.2,0.5,0.1)+0.3(-0.1, -0.2, 0.5, 0.9) \\ & =(0.07,0.14,0.65,0.37) \\ & W=\left[\begin{array}{lll} 0.608 & 0.28 & 0.07 \\ 0.657 & 0.14 & 0.14 \\ 0.455 & 0.21 & 0.65 \\ 0.049 & 0.37 & 0.37 \end{array}\right] \\ & \end{aligned} \end{equation}$

Epoch 1 is complete.

Reduce the learning rate:

$\alpha(t+1) = 0.2 \alpha(t) = 0.2(0.3) = 0.06$

Repeat from the start for new epochs until $∆w_j$ becomes steady for all input patterns or the error is within a tolerable range.

Hopfield Network

Recurrent Topology

It is the pioneering work of Hopfield in the early 1980’s that led the way for the designing of neural networks with feedback paths and dynamics.

The work of Hopfield is seen by many as the starting point for the implementation of associative (content addressable) memory by using a special structure of recurrent neural networks.

Used in:

Information retrieval and for pattern and speech recognition,
Optimization problems,
Combinatorial optimization problems such as the traveling salesman problem.

Feature of Hopfield Network:

the output of each unit can take a binary value (either 0 or 1) or a bipolar value (either -1 or 1).
Output value is fed back to all the input units of the network
- except to the one corresponding to that output.
Recurrent Topology
Special Activation Function
- $o_i = sign(\sum^n_{j=1} w_{ij}o_j- \theta_i) = 1$ if $\sum_{i\neq j} w_{ij}o_j > \theta_i$
- $o_i = sign(\sum^n_{j=1} w_{ij}o_j- \theta_i) = -1$ $o_{i} = s i g n (\sum_{j = 1}^{n} w_{ij} o_{j} - θ_{i}) = - 1$ if $\sum_{i\neq j} w_{ij}o_j < \theta_i$ $\sum_{i \neq = j} w_{ij} o_{j} < θ_{i}$
  - $o_i$ : the output of the current processing unit (Hopfield neuron)
  - $\theta_i$ : threshold value
Energy function
- defined as to decrease monotonically with variation of the output states until a minimum is attained.
- $E = \frac{1}{2}\sum\sum_{i\neq j} w_{ij}o_io_j + \sum o_i \theta_i$
- the energy function E of the network continues to decrease until it settles by reaching a local minimum.
- $\Delta E = -\frac{1}{2} \Delta o_i (\sum_{i \neq j}w_{ij}o_j - \theta_i)$

Hebbian Learning

The learning algorithm for the Hopfield network is based on the so called Hebbian learning rule.

based on the idea that when two units are simultaneously activated, their interconnection weight increase becomes proportional to the product of their two activities.
This is one of the earliest procedures designed for carrying out supervised learning.
It is based on the idea that when two units are simultaneously activated, their interconnection weight increase becomes proportional to the product of their two activities.

The Hebbian learning rule also known as the outer product rule of storage, as applied to a set of $q$ presented patterns $p_k(k=1, \ldots, q)$ each with dimension $n$ ( $n$ denotes the number of neuron units in the Hopfield network), is expressed as:

$w_{i j}= \begin{cases}\frac{1}{n} \sum_{k=1}^q p_{k j} p_{k i} & \text { if } i \neq j \\ 0 & \text { if } i=j\end{cases}$

The weight matrix $W=\left\{w_{i j}\right\}$ could also be expressed in terms of the outer product of the vector $p_k$ as:

$W=\left\{w_{i j}\right\}=\frac{1}{n} \sum_{k=1}^q p_k p_k^T-\frac{q}{n} \mathrm{I}$

Learning Algorithm

Step 1 (storage):
- Store the patterns through establishing the connection weights. Each of the $q$ fundamental memories presented is a vector of bipolar elements (+1 or -1).
Step 2 (initialization):
- Presenting to the network an unknown pattern $u$ with same dimension as the fundamental patterns.
- Every component of the network outputs at the initial iteration cycle is set as $o(0) = u$
Step 3 (retrieval 1):
- Each one of the component $o_i$ $o_{i}$ of the output vector $o$ $o$ is updated from cycle $l$ $l$ to cycle $l+1$ $l + 1$ by:
  - $o_i(l+1) = sgn(\sum^n_{j=1} w_{ij}o_j(l))$
  - This process is known as asynchronous updating.
  - The process continues until no more changes are made and convergence occurs.
Step 4 (retrieval 2):
- Continue the process for other presented unknown patterns by starting again from step 2.

Example of Hebbian Learning in Hopfield Network

Problem Statement

We need to store a fundamental pattern (memory) given by the vector $O=[1,1,1,−1]^T$ $O = [1, 1, 1, - 1]^{T}$ in a four node binary Hopfield network.
- Two potential attractors:
  - the original fundamental pattern $[1,1,1,−1]^T$
  - the complement of original fundamental pattern $[-1,-1,-1,1]^T$
Presume that the threshold parameters are all equal to zero.

Establish Connection Weights:

Weight matrix expression discarding 1/4 and having $q = 1$

$W=\frac{1}{n} \sum_{k=1}^q p_k p_k^T-\frac{q}{n} \mathrm{I}=p_1 p_1^T-\mathrm{I}$
- Therefore:
$W=\left[\begin{array}{c} 1 \\ 1 \\ 1 \\ -1 \end{array}\right]\left[\begin{array}{llll} 1 & 1 & 1 & -1 \end{array}\right]-\left[\begin{array}{llll} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array}\right]=\left[\begin{array}{cccc} 0 & 1 & 1 & -1 \\ 1 & 0 & 1 & -1 \\ 1 & 1 & 0 & -1 \\ -1 & -1 & -1 & 0 \end{array}\right]$

Network’ States and Their Code

Total number of states: There are $2^n = 2^4 = 16$ different states.

All thresholds are equal to zero: $\theta_i=0, \quad i=1,2,3,4$ . Therefore,

Now we compute the Energy Level of State.

$\begin{aligned} & E=-1 / 2 \sum_{i=1}^4 \sum_{j=1}^4 w_{i j} o_i o_j \\ & E=-1 / 2\left(w_{11} o_1 o_1+w_{12} o_1 o_2+w_{13} o_1 o_3+w_{14} o_1 o_4+\right. \\ & w_{21} o_2 o_1+w_{22} o_2 o_2+w_{23} o_2 o_3+w_{24} o_2 o_4+ \\ & w_{31} o_3 o_1+w_{32} o_3 o_2+w_{33} o_3 o_3+w_{34} o_3 o_4+ \\ & \left.w_{41} o_4 o_1+w_{42} o_4 o_2+w_{43} o_4 o_3+w_{44} o_4 o_4\right) \\ & \end{aligned}$

We need to compute the Energy level of each state.

For example state A, we have $A = [o_1, o_2, o_3, o_4]= [1,1,1,1]$

We make use of $W$

$E_A = \frac{1}{2}(0+1+1-1+1+0+1-1+1+1+0-1-1-1-1+0) = 0$

Similarly, we can compute the energy level of the other states.

Retrieval Stage

We update the components of each state asynchronously using equation:

$o_i = sign(\sum^n_{j=1} w_{ij}o_j- \theta_i)$

Updating the state asynchronously means that for every state presented we activate one neuron at a time.

All states change from high energy to low energy levels.

For example state transition for state J, we start from $J = [o_1, o_2, o_3, o_4] = [-1, -1, 1, -1]^T$

Updating $o_1$ :

$\begin{align} o_1 & = sgn(w_{12}o_2 + w_{13}o_3 + w_{14}o_4)\\ & = sgn((1)(-1) + (1)(1) + (-1)(-1))\\ & = sgn(+1) = +1 \end{align}$

As a result, the first component of the state J changes from −1 to 1.

In other words, the state J transits to the state G at the end of first transition.

Now $[o_1, o_2, o_3, o_4] = [1, -1, 1, -1]^T$ => State G

Updating $o_2$ :

$\begin{align} o_2 & = sgn(w_{21}o_1 + w_{23}o_3 + w_{24}o_4)\\ & = sgn((1)(1) + (1)(1) + (-1)(-1))\\ & = sgn(+3) = +1 \end{align}$

As a result, the second component of the state G changes from −1 to 1.

In other words, the state G transits to the state B at the end of first transition.

Now $[o_1, o_2, o_3, o_4] = [1, 1, 1, -1]^T$ => State B

As state B is a fundamental pattern, no more transition will occur. But lets check:

Updating $o_3$ :

$\begin{align} o_3 & = sgn(w_{31}o_1 + w_{32}o_2 + w_{34}o_4)\\ & = sgn((1)(1) + (1)(1) + (-1)(-1))\\ & = sgn(+3) = +1 \end{align}$

As a result, the third component of the state B changes from 1 to 1.

Therefore no transition is observed.

Again as state B is a fundamental pattern, no more transition will occur.

Updating $o_4$ :

$\begin{align} o_4 & = sgn(w_{12}o_2 + w_{13}o_3 + w_{14}o_4)\\ & = sgn((1)(1) + (1)(1) + (-1)(-1))\\ & = sgn(+3) = +1 \end{align}$

As a result, the fourth component of the state B changes from 1 to 1.

Therefore no transition is observed.

Limitations of Hopfield Network

Limited stable-state storage capacity of the network,
Hopfield estimated roughly that a network with n processing units should allow for 0.15n stable states.
Many studies have been carried out recently to increase the capacity of the network without increasing much the number of the processing units