Tuesday 2 May 2023

Neural Network from Scratch Using C (Part-3)

 

XOR Problem

In part 2, AND gate and OR gate could be trained using a single perceptron. But XOR gate could not be as it is not linearly separable. The XOR function can be represented as \begin{align} y &= x_{1}\bar{x_{2}}+\bar{x_{1}}x_{2} \\ z_{1} &= x_{1}\bar{x_{2}} \\ z_{2} &= \bar{x_{1}}x_{2} \\ y &= z_{1}+z_{2} \end{align} Looking at the above equations \begin{align} z_{1} &= x_{1} \land \bar{x_{2}} \\ z_{2} &= \bar{x_{1}} \land x_{2} \\ y &= z_{1} \lor z_{2} \end{align} y is splited into 2 equation z1 and z2, 2 linearly separable functions hence it can not be realized using a single neuron. Then the solution will be a multiple neurons. The organization of multiple neurons stacked in different layers is known as "Multi Layer Perceptron". In this part we will see a very basic multi layer perceptron to solve XOR problem. This chapter involves calculus differential. For Information purpose basic things will be discussed but no in-depth calculus will be used. It is assumed the reader is well aware of derivatives.

Multi layer Perceptron

A Multi layer perceptron (MLP) is a network of neurons arranged in more than one layer. That is Output, Hidden and Input. Generally Input Layer is a direct connecting layer. Hidden layer is where the real magic happens and the output layer is interface to user. A hidden layer is called as such because user doesn't interact with that layer. In an MLP multiple hidden layers may be present. But there is a limitation which is due to "Vanishing Gradient Problem". If interested then There are enough material present on internet. In an MLP with one input layer, one hidden layer and one output layer, and the network is fully connected that is each neuron of a layer is connected each neuron of immediate nearby layer. Please See the picture Bellow.

One thing to observe here is Hidden layer. Each the neurons in this layer has connections from each neuron in Input layer, and they are connected to output neurons. This is a simple fully connected neural network with 4 inputs, 5 hidden and one output neuron. In Future multiple output networks will be discussed and used.

The connections between layers are weights. These weights are generally represented using matrices. for convenience we will represent each weight as wij where i is ith neuron of next layer and j is jth neuron of current layer. In above case weight matrix between input and hidden layer should be defined as a matrix Wih[5,4]=> 5 rows and 4 columns and the weight matrix between hidden and output is Who[1,5]. For Input, hidden output and final output the matrix will be a column matrix.

Forward Propagation

Input data moves through the network and creates some output. This is called forward propagation. Previously in single neuron, the calculation was easier. Here in case of multiple layers it is a bit involved. It involves matrix multiplication. Consider above Image.input arriving at ith neuron of hidden layer can be expressed as: \begin{equation} h_{i} = w_{i0}x_{0}+w_{i1}x_{1}+w_{i2}x_{2}+w_{i3}x_{3} \end{equation} or in general terms it can be expressed as \begin{equation} h_{i} = \sum_{j=0}^{n}w_{ij}x_{j} \end{equation} If set of equation for hi is observed closely then it is seen that it is result of matrix multiplication. $$ \begin{bmatrix} h_{0} \\ h_{1} \\ h_{2} \\ h_{3} \\ h_{4} \end{bmatrix} = \begin{bmatrix} w_{00}^{ih}&w_{01}^{ih}&w_{02}^{ih}&w_{03}^{ih} \\ w_{10}^{ih}&w_{11}^{ih}&w_{12}^{ih}&w_{13}^{ih} \\ w_{20}^{ih}&w_{21}^{ih}&w_{22}^{ih}&w_{23}^{ih} \\ w_{30}^{ih}&w_{31}^{ih}&w_{32}^{ih}&w_{33}^{ih} \\ w_{40}^{ih}&w_{41}^{ih}&w_{42}^{ih}&w_{43}^{ih} \end{bmatrix} \times \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ x_{3} \end{bmatrix} $$ $$H = W^{ih}.X$$

This resultant matrix need to be applied activation function of hidden layer to produce output. Thus generated values is input to the out put layer(in our case output neuron). As previous layer we need to do a matrix multiplication to find the input to the output layer and apply activation function of output layer to get the network output. $$ \begin{bmatrix} O_{0} \end{bmatrix} = \begin{bmatrix} w_{0}^{ho}&w_{1}^{ho}&w_{2}^{ho}&w_{3}^{ho}&w_{4}^{ho} \end{bmatrix} \times \begin{bmatrix} h_{0} \\ h_{1} \\ h_{2} \\ h_{3} \\ h_{4} \end{bmatrix} $$ $$ O = W^{ho}.H $$

Back Propagation

The Output O is dependent on weights Who and H. H is dependent on weights Wih. As the selection of initial weights are random, hence they are unlikely to be perfect. Hence the output O is different than expected or target. The difference of Target and Output is known as Error. This error has to be compensated by adjusting weights Who and Wih. The adjustment of weights starts from last layer and goes down to beginning. Hence it is called Back Propagation.

Back propagation of errors has to be adjusted among different layers, depending on how much each layer contributed. To find out contribution of error a partial derivative of error has to be found out with respect to different contributing elements. Description and derivation of this is beyond the scope as it will lengthen the article. For reference one can watch this video

We can represent the updated weight matrix of a layer as follows $$ W_{new}=W_{cur}+\alpha[target-actual output].X$$ where 

  • X is input to the layer 
  • α is learning rate

 

Implementation

To adjust weights in Who>, we need to find error in output, then apply derivative of activation function to it. Next multiply by scalar learning rate(α). This finish scalar calculations. The resultant multiplied by the Input to the layer that is activated output of hidden layer will give us delta weights. this delta weights added to the weight matrix Woh gives updated weights in Hidden to Output layer.

Same procedure have to be repeated for input to hidden layer.  To begin with we need to know error at hidden layer first. This can be found by multiplying (Woh)T with error at output. Rest procedures are same as above.


	// find error in output
	matrix err_out=subtract_matrix(expect,final_out);
	matrix whoT = transpose(nn.who);
	matrix err_hid=mat_mult(whoT,err_out);
	
	// calculate delta errors 
	matrix_map(&final_out,sigmoid_prime);
	mat_mult_hadamard(&err_out,&final_out);
	matrix hidT= transpose(hidden_out);
	matrix delta_who = mat_mult(err_out,hidT);
	mat_mult_scalar(nn.lr,&delta_who);
	
	matrix_map(&hidden_out,sigmoid_prime);
	mat_mult_hadamard(&err_hid,&hidden_out);
	matrix inpT = transpose(inp);
	matrix delta_wih = mat_mult(err_hid,inpT);
	mat_mult_scalar(nn.lr,&delta_wih);
	// update weights
	matrix tempa =add_matrix(nn.who,delta_who);
	assign_matrix(&nn.who,&tempa);
	matrix tempb = add_matrix(nn.wih,delta_wih);
	assign_matrix(&nn.wih,&tempb);

With this we are ready to make a very small neuralnet and train it for XOR problem.  MNIST dataset can also be trained but needs patience and time( I achieved with 91.6% accuracy). Iris Dataset and wheat seed data set are very easily trained. Bellow is XOR problem implementation.


#include <stdio .h >
#include "../mlp1.c"
int main(int argc, char *argv[])
{
	srand(0);
	nnet my_nn =create_nn(2,3,1);

	// training
	
	for(int n = 0; n< 30000; n++)
	{
		matrix inp = create_matrix(2,1);
		matrix expected = create_matrix(1,1);
		
		set_cell(&inp,0,0,1.0);
		set_cell(&inp,1,0,0.0);
		set_cell(&expected,0,0,1);
		train(my_nn,inp,expected);
		
		set_cell(&inp,0,0,1.0);
		set_cell(&inp,1,0,1.0);
		set_cell(&expected,0,0,0);
		train(my_nn,inp,expected);
		
		set_cell(&inp,0,0,0.0);
		set_cell(&inp,1,0,1.0);
		set_cell(&expected,0,0,1);
		train(my_nn,inp,expected);
		
		set_cell(&inp,0,0,0.0);
		set_cell(&inp,1,0,0.0);
		set_cell(&expected,0,0,0);
		train(my_nn,inp,expected);
		
		delete_matrix(&expected);	
		delete_matrix(&inp);
	}
	// save the trained netork
	save_nn(my_nn,"xor2.net");

	// load nn from file
	my_nn = load_nn("xor2.net");
	// test trained network
	matrix inp = create_matrix(2,1);
	set_cell(&inp,0,0,1.0);
	set_cell(&inp,1,0,1.0);
	matrix out = predict(my_nn,inp);
	display_matrix(out);
	delete_matrix(&out);
	set_cell(&inp,0,0,0.0);
	set_cell(&inp,1,0,0.0);
	out = predict(my_nn,inp);
	display_matrix(out);
	delete_matrix(&out);
	set_cell(&inp,0,0,0.0);
	set_cell(&inp,1,0,1.0);
	out=predict(my_nn,inp);
	display_matrix(out);
	delete_matrix(&out);
	set_cell(&inp,0,0,1.0);
	set_cell(&inp,1,0,0.0);
	out = predict(my_nn,inp);
	display_matrix(out);
	
	delete_matrix(&out);
	delete_nn(my_nn);
	delete_matrix(&inp);
	return 0;
}
=======output====
[ashok@fedora xor]$ ./xor
[ 0.00559855 ]

[ 0.06528170 ]

[ 0.96165163 ]

[ 0.96165793 ]

Though I may put all the code here or on github, but I would like the reader to implement his own version analogous to this implementation. In any problem Contact me for the Full code its free But before asking please try to implement yourself.

Till now we discussed

  • What is an artificial neuron
  • How to implement it
  • What are the areas it can be used(linear separable)
  • How to train it
  • What is XOR problem
  • What is Multi layer neural network
  • How to implement one
  • How to train
  • Example of XOR implementation
This is the end of 3 parts for simple neural network. Part1 and Part2 can be found here. In some future article we will discuss Convolutional Neural Network(CNN)
 

Sunday 30 April 2023

Neural Network from Scratch Using C (Part-2)

 

Implementing a Perceptron

In this part we will start implementing a single perceptron and do some experiments. 

first part can be found here.

A single Perceptron

A single Perceptron(neuron) can do a lot of jobs. Yes Like a Neuron in Our nervous system it can do a lot of things. As One grows from Infants to toddler the neurons in our brain get trained(learned) to coordinate between our limbs, coordinate our activities etc...
Similarly the Perceptrons can be trained to classify data as per the training data. but we can only classify linearly separable data.

What is a linearly separable?
Linear separability can be termed as, if the plot of data can be separated by a single straight line. This means depending on the separation criterion, there exists a straight line dividing the data points in to 2 regions.

Now lets see a small code implementing it step by step. NOTE: We need a matrix manipulation library for all our Neural network adventure. For this purpose my tmatlib on github is quite suitable. Though one should not expect it to perform like blas. In my code i used a similar version known as smatlib. As it is continuously growing I have not published it on github. But stay assure the functionality and performance of both are same. 

// structure of our perceptron.

typedef struct perceptron
{
	int n_inp;   // number inputs
	matrix W;    // weight matrix
	double lr;   // learning rate
	double bias; // bias
}perceptron;

A straight line on a 2-dimentional plain can be expressed as y=mx+c. The Bias parameter in our perceptron is related to c. Our training data is x and y, x is input and y is expected or target. Our perceptron learns the value of m. Learning rate(lr) determines how fast or slow the perceptron should learn. If Learning rate is too low then it may be trapped in local minima. If it is too high then it may oscillate and may not reach the solution. So Bottom line is one has to experiment with learning rate(lr) value to achieve better result. Weight matrix is initialized with random values. Gradually with training it is adjusted/updated to achieve better result. 

 I uploaded all the codes at Simple-NeuralNetwork.

In the repo there are 4 files perceptron.h, linearsep.c, and_gate.c and or_gate.c. We will discuss bits from those files and see how they work.

In perceptron.h file, there are functions/methods to create, manipulate, train, display and delete the perceptron. General sequence is 

  • create a perceptron
  • prepare data set
  • train
  • test with predict function

This is simple, is not it? Now we will look into 2 functions, namely predict and train. These  functions are the core of this perceptron.

predict:

double predict(perceptron p, double inp[])
{
	double res=0.0;
	
	for(int r = 0; r < p.n_inp; r++)
	{
		res += inp[r]*get_cell(p.W,0,r);
	}
	res += p.bias;
	double ret= sigmoid(res);
	return(ret);
} 

This function multiplies inputs with corresponding weights and adds them. Next, to activate the output apply sigmoid function. A sigmoid function ranges from 0 to 1. Please see the first part here where the sum process and sigmoid function is defined.

Train:

  void train(perceptron *p,double *data,double desired)
  {
	double guess = predict(*p,data);
	
	double err = desired - guess;
	for(int n = 0; nn_inp; n++)
	{
		double cur_w = get_cell(p->W,0,n);
		set_cell(&p->W,0,n,cur_w+p->lr*err*data[n]);
	}
	p->bias += p->lr *err;
  } 
 

In this function, perceptron is trained by adjusting weights and bias.The steps can be as follows.

  • calculate output
  • find error by subtracting output from target
  • find delta weight by multiplying error, data(input) and learning rate
  • find delta bias by multiplying error and learning rate
  • update weights
  • update bias

In step 3 we are calculating the delta weights. This is the most important step in whole code. How and why delta value is multiplication of error with input is a topic itself.  As an Hint it is gradient, hence a derivative of error term with inputs and weights are to be considered.

Testing

To tryout this implementation, 3 code files were provided. which adheres to the general sequence. First one is linear_sep.c, In this demo program, random data sets are created and the perceptron is trained; then with a known data set it is compared. the output on my system is  as bellow.


************Perceptron:**************

number of inputs = 2
Weights:
[ 1.00000000 ]
[ 1.00000000 ]

Bias = 1.000000
*************************************
************Perceptron:**************

number of inputs = 2
Weights:
[ 61.70302487 ]
[ -31.13024636 ]

Bias = 0.622900
*************************************

is (2.000000 < 2 x 20.000000 + 1) predicted = 1.000000

For other 2 demo codes. Can this artificial neuron mimic reliably a digital gate? Well that was the motive when artificial neuron was proposed. It is observed that AND and OR gate are possible to simulate but not XOR gate. WHY???  from bellow diagrams it is obvious that OR function and AND function are linear separable but not the XOR function.


// OR gate implementation.


    // x1 x2 y
    //  0  0 0
    //  0  1 1
    //  1  0 1
    //  1  1 1
    //
    //  0,1-------1,1
    //   |          |
    //  \|          |
    //   |\         |
    //   | \        |
    //  0,0-\-----1,0
    //       \-separating line
    
    // AND gate implementation.
	// x1 x2 y
	//  0  0 0
	//  0  1 0
	//  1  0 0
	//  1  1 1
	//
	//  0,1---\----1,1
	//   |      \    |
	//   |        \  |
	//   |          \|
	//   |           |\-line separating  
	//  0,0-------1,0
// XOR function
// x1  x2  y
//  0   0  0
//  0   1  1
//  1   0  1
//  1   1  0
//        /----------|----- 2 lines separates
//  0,1--/-----1,1   |
//   |  /       |    |
//   | /        |    |
//   |/         |/---|
//   /          /
//  /|         /|
//  0,0-------/-1,0
    

Output of OR gate 

************Perceptron:**************

number of inputs = 2
Weights:
[ 0.00000000 ]
[ 0.00000000 ]

Bias = 1.000000
*************************************
************Perceptron:**************

number of inputs = 2
Weights:
[ 11.48282021 ]
[ 11.48242093 ]

Bias = -5.281243
*************************************
inputs 0, 0 predicted = 0.005060
inputs 1, 0 predicted = 0.997978
inputs 0, 1 predicted = 0.997977
inputs 1, 1 predicted = 1.000000

 

Output of AND gate

************Perceptron:**************

number of inputs = 2
Weights:
[ 0.00000000 ]
[ 0.00000000 ]

Bias = 1.000000
*************************************
************Perceptron:**************

number of inputs = 2
Weights:
[ 10.22930534 ]
[ 10.22993658 ]

Bias = -15.512550
*************************************
inputs 0, 0 predicted = 0.000000
inputs 1, 0 predicted = 0.005050
inputs 0, 1 predicted = 0.005053
inputs 1, 1 predicted = 0.992943

In the above result if we consider 0.99 as 1 and anything less that 0.005 as 0 then our results are at par with the truth table.

As we saw above XOR is not linearly separable, we can not simulate it with a single perceptron. we need more than one layer to simulate it we will do it in a future post.

Till then happy coding.

Saturday 29 April 2023

Neural Network from Scratch using C

Hello There, 

Around a year ago we talked. Now several things have changed. I recently joined as a PhD scholar 😃. And decided to soil my hands in AI/ML. During my M.Tech days(2011-2013), we are taught Soft computing. Though I did my Thesis on Network On Chip (NOC) which is related to putting different IPs on to a silicon substrate. I devised a Genetic Algorithm based best mapping  as well as a deterministic heuristic based mapping using graphs. both the codes are available on my github page. But I have never paid much attention to ML till 2016, when I was developing a traffic detection  application for a client. thought it was never went beyond PoC stage.

Now I thought to put all my knowledge to earn a degree (Yes a Doctorate). To refresh my understandings of my master's degree time, I decided to go step by step implementing building blocks to learn this topic.

We should start by putting a single neuron first. The artificial neurons, the counter part of natural neurons share similar characteristic.

Perceptrons - the most basic form of a neural network · Applied Go 

 Dendrites are inputs and axon is output. But one thing we must put here which is Activation. It is the strength we provide as output.

in case of perceptrons input are applied through some weights. This means different input channel have different weights. the perceptron sums the weighted inputs and apply some sort of activation before output the result.

So mathematically we can write 

output = activation(sum ( inputs x weights)) 

output = activation(sum ( input array * weights array))

we can say we need 2 one dimensional matrices for inputs and weights.  Then upon the result of the multiplication we apply activation function.

Lets assume we have 3 inputs i1, i2, i3 with weights w1,w2 and w3. Now output will be \begin{equation} z=\sum_{1}^{n} i_mw_m \end{equation} next activation of z is output. 

There are several activation functions used to activate the output of perceptron. the most used are Binary activation, ReLU(0,x) and Sigmoid(0,1). 

We will use sigmoid for our application. \begin{equation}\sigma(z)=\frac{1}{1+e^{-z}}\end{equation}

 Next we will see the simple implementation of this perceptron.


Bye Bye for now.

Neural Network from Scratch Using C (Part-3)

  XOR Problem In part 2, AND gate and OR gate could be trained using a single perceptron. But XOR gate could not be as it is not linearly...