XOR Problem
In part 2, AND gate and OR gate could be trained using a single perceptron. But XOR gate could not be as it is not linearly separable. The XOR function can be represented as \begin{align} y &= x_{1}\bar{x_{2}}+\bar{x_{1}}x_{2} \\ z_{1} &= x_{1}\bar{x_{2}} \\ z_{2} &= \bar{x_{1}}x_{2} \\ y &= z_{1}+z_{2} \end{align} Looking at the above equations \begin{align} z_{1} &= x_{1} \land \bar{x_{2}} \\ z_{2} &= \bar{x_{1}} \land x_{2} \\ y &= z_{1} \lor z_{2} \end{align} y is splited into 2 equation z1 and z2, 2 linearly separable functions hence it can not be realized using a single neuron. Then the solution will be a multiple neurons. The organization of multiple neurons stacked in different layers is known as "Multi Layer Perceptron". In this part we will see a very basic multi layer perceptron to solve XOR problem. This chapter involves calculus differential. For Information purpose basic things will be discussed but no in-depth calculus will be used. It is assumed the reader is well aware of derivatives.
Multi layer Perceptron
A Multi layer perceptron (MLP) is a network of neurons arranged in more than one layer. That is Output, Hidden and Input. Generally Input Layer is a direct connecting layer. Hidden layer is where the real magic happens and the output layer is interface to user. A hidden layer is called as such because user doesn't interact with that layer. In an MLP multiple hidden layers may be present. But there is a limitation which is due to "Vanishing Gradient Problem". If interested then There are enough material present on internet. In an MLP with one input layer, one hidden layer and one output layer, and the network is fully connected that is each neuron of a layer is connected each neuron of immediate nearby layer. Please See the picture Bellow.
One thing to observe here is Hidden layer. Each the neurons in this layer has connections from each neuron in Input layer, and they are connected to output neurons. This is a simple fully connected neural network with 4 inputs, 5 hidden and one output neuron. In Future multiple output networks will be discussed and used.
The connections between layers are weights. These weights are generally represented using matrices. for convenience we will represent each weight as wij where i is ith neuron of next layer and j is jth neuron of current layer. In above case weight matrix between input and hidden layer should be defined as a matrix Wih[5,4]=> 5 rows and 4 columns and the weight matrix between hidden and output is Who[1,5]. For Input, hidden output and final output the matrix will be a column matrix.
Forward Propagation
Input data moves through the network and creates some output. This is called forward propagation. Previously in single neuron, the calculation was easier. Here in case of multiple layers it is a bit involved. It involves matrix multiplication. Consider above Image.input arriving at ith neuron of hidden layer can be expressed as: \begin{equation} h_{i} = w_{i0}x_{0}+w_{i1}x_{1}+w_{i2}x_{2}+w_{i3}x_{3} \end{equation} or in general terms it can be expressed as \begin{equation} h_{i} = \sum_{j=0}^{n}w_{ij}x_{j} \end{equation} If set of equation for hi is observed closely then it is seen that it is result of matrix multiplication. $$ \begin{bmatrix} h_{0} \\ h_{1} \\ h_{2} \\ h_{3} \\ h_{4} \end{bmatrix} = \begin{bmatrix} w_{00}^{ih}&w_{01}^{ih}&w_{02}^{ih}&w_{03}^{ih} \\ w_{10}^{ih}&w_{11}^{ih}&w_{12}^{ih}&w_{13}^{ih} \\ w_{20}^{ih}&w_{21}^{ih}&w_{22}^{ih}&w_{23}^{ih} \\ w_{30}^{ih}&w_{31}^{ih}&w_{32}^{ih}&w_{33}^{ih} \\ w_{40}^{ih}&w_{41}^{ih}&w_{42}^{ih}&w_{43}^{ih} \end{bmatrix} \times \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ x_{3} \end{bmatrix} $$ $$H = W^{ih}.X$$
This resultant matrix need to be applied activation function of hidden layer to produce output. Thus generated values is input to the out put layer(in our case output neuron). As previous layer we need to do a matrix multiplication to find the input to the output layer and apply activation function of output layer to get the network output. $$ \begin{bmatrix} O_{0} \end{bmatrix} = \begin{bmatrix} w_{0}^{ho}&w_{1}^{ho}&w_{2}^{ho}&w_{3}^{ho}&w_{4}^{ho} \end{bmatrix} \times \begin{bmatrix} h_{0} \\ h_{1} \\ h_{2} \\ h_{3} \\ h_{4} \end{bmatrix} $$ $$ O = W^{ho}.H $$
Back Propagation
The Output O is dependent on weights Who and H. H is dependent on weights Wih. As the selection of initial weights are random, hence they are unlikely to be perfect. Hence the output O is different than expected or target. The difference of Target and Output is known as Error. This error has to be compensated by adjusting weights Who and Wih. The adjustment of weights starts from last layer and goes down to beginning. Hence it is called Back Propagation.
Back propagation of errors has to be adjusted among different layers, depending on how much each layer contributed. To find out contribution of error a partial derivative of error has to be found out with respect to different contributing elements. Description and derivation of this is beyond the scope as it will lengthen the article. For reference one can watch this video
We can represent the updated weight matrix of a layer as follows $$ W_{new}=W_{cur}+\alpha[target-actual output].X$$ where
- X is input to the layer
- α is learning rate
Implementation
To adjust weights in Who>, we need to find error in output, then apply derivative of activation function to it. Next multiply by scalar learning rate(α). This finish scalar calculations. The resultant multiplied by the Input to the layer that is activated output of hidden layer will give us delta weights. this delta weights added to the weight matrix Woh gives updated weights in Hidden to Output layer.
Same procedure have to be repeated for input to hidden layer. To begin with we need to know error at hidden layer first. This can be found by multiplying (Woh)T with error at output. Rest procedures are same as above.
// find error in output
matrix err_out=subtract_matrix(expect,final_out);
matrix whoT = transpose(nn.who);
matrix err_hid=mat_mult(whoT,err_out);
// calculate delta errors
matrix_map(&final_out,sigmoid_prime);
mat_mult_hadamard(&err_out,&final_out);
matrix hidT= transpose(hidden_out);
matrix delta_who = mat_mult(err_out,hidT);
mat_mult_scalar(nn.lr,&delta_who);
matrix_map(&hidden_out,sigmoid_prime);
mat_mult_hadamard(&err_hid,&hidden_out);
matrix inpT = transpose(inp);
matrix delta_wih = mat_mult(err_hid,inpT);
mat_mult_scalar(nn.lr,&delta_wih);
// update weights
matrix tempa =add_matrix(nn.who,delta_who);
assign_matrix(&nn.who,&tempa);
matrix tempb = add_matrix(nn.wih,delta_wih);
assign_matrix(&nn.wih,&tempb);
With this we are ready to make a very small neuralnet and train it for XOR problem. MNIST dataset can also be trained but needs patience and time( I achieved with 91.6% accuracy). Iris Dataset and wheat seed data set are very easily trained. Bellow is XOR problem implementation.
#include <stdio .h >
#include "../mlp1.c"
int main(int argc, char *argv[])
{
srand(0);
nnet my_nn =create_nn(2,3,1);
// training
for(int n = 0; n< 30000; n++)
{
matrix inp = create_matrix(2,1);
matrix expected = create_matrix(1,1);
set_cell(&inp,0,0,1.0);
set_cell(&inp,1,0,0.0);
set_cell(&expected,0,0,1);
train(my_nn,inp,expected);
set_cell(&inp,0,0,1.0);
set_cell(&inp,1,0,1.0);
set_cell(&expected,0,0,0);
train(my_nn,inp,expected);
set_cell(&inp,0,0,0.0);
set_cell(&inp,1,0,1.0);
set_cell(&expected,0,0,1);
train(my_nn,inp,expected);
set_cell(&inp,0,0,0.0);
set_cell(&inp,1,0,0.0);
set_cell(&expected,0,0,0);
train(my_nn,inp,expected);
delete_matrix(&expected);
delete_matrix(&inp);
}
// save the trained netork
save_nn(my_nn,"xor2.net");
// load nn from file
my_nn = load_nn("xor2.net");
// test trained network
matrix inp = create_matrix(2,1);
set_cell(&inp,0,0,1.0);
set_cell(&inp,1,0,1.0);
matrix out = predict(my_nn,inp);
display_matrix(out);
delete_matrix(&out);
set_cell(&inp,0,0,0.0);
set_cell(&inp,1,0,0.0);
out = predict(my_nn,inp);
display_matrix(out);
delete_matrix(&out);
set_cell(&inp,0,0,0.0);
set_cell(&inp,1,0,1.0);
out=predict(my_nn,inp);
display_matrix(out);
delete_matrix(&out);
set_cell(&inp,0,0,1.0);
set_cell(&inp,1,0,0.0);
out = predict(my_nn,inp);
display_matrix(out);
delete_matrix(&out);
delete_nn(my_nn);
delete_matrix(&inp);
return 0;
}
=======output====
[ashok@fedora xor]$ ./xor
[ 0.00559855 ]
[ 0.06528170 ]
[ 0.96165163 ]
[ 0.96165793 ]
Though I may put all the code here or on github, but I would like the reader to implement his own version analogous to this implementation. In any problem Contact me for the Full code its free But before asking please try to implement yourself.
Till now we discussed
- What is an artificial neuron
- How to implement it
- What are the areas it can be used(linear separable)
- How to train it
- What is XOR problem
- What is Multi layer neural network
- How to implement one
- How to train
- Example of XOR implementation