Neural Networks — The Underlying Technology of Artificial Intelligence

Sohum Padhye
12 min readNov 1, 2022

--

Everything from the fundamentals of tensors and perceptrons, to the architecture of neural networks.

Artificial Intelligence (AI) is what your phones use when they unlock your phone just by looking at your face. They can match the features of your face and who is looking at the camera to determine that it’s you who has your phone. It’s also what many companies use to determine which products/things you’re more likely to like. Companies like Amazon and Netflix use this all the time. They take what you currently like and suggest similar things to that.

When I first looked into AI, I was fascinated because there were endless possibilities for what you could build with it. Want to make a bot that can aim and shoot at targets in shooter games for you? Done. Want to develop a model that can predict which customers are most likely to buy your product? Done. Want to create an agent that can teach itself new strategies to complete a task more efficiently? You can do that too.

However, to create one of these it’s not like you can just slap “AI” on a code block. You have to create something called a neural network. Essentially, a neural network is something that can derive an output from an input, then look at how far it was from the ideal solution and learn from its mistakes.

To learn more about this technology, I took this course from Udacity to learn more about the intuition behind neural networks, and how to build a neural network with PyTorch. The first application of neural networks that I learned about was classification, and that’s what I’m going to be covering in this article.

Overview:

  • Classification
  • Perceptrons
  • Weights and Biases
  • Feed-forward
  • Backpropagation
  • Activation Functions
  • Error Functions
  • Gradient Descent
  • Overfitting

Classification

In the case of artificial intelligence, classification can be defined as splitting a dataset into more than one label. For example, I have 6 items that I want to classify into two categories: cars and everything but cars. The classification would look like this:

For us humans, it seems pretty intuitive. We can easily distinguish the cars from the not-cars, but how would something like a neural network do this? It’s not like a neural network instinctively knows which images are cars and which ones aren’t. So, the question arises — how would we train a neural network to identify images of cars and tell when a given image is not a car?

Let’s take another example. Let’s say that a racing league will only admit racing teams who get a lap time, for both of their drivers, below a certain value. That would look something like this:

The x- and y-value of each point in the graph tell how Driver 1 and Driver 2 performed, respectively. All teams in the blue area are more likely to get accepted, and all teams in the red area are more likely to get rejected. The closer a point is to (0, 0), the more likely they will be accepted into the racing league. In this case, the x-axis is denoted as x1, and the y-axis is denoted as x2. Y is the probability that the point’s label is 1 (or the probability that a point is in the blue boundary — more on this in the rest of the article).

However, there’s a problem with determining who will be in the league and who won’t. Take a look at these two points.

Even though one driver on their team performed terribly, these teams are still being accepted into the racing league. So, to make better decisions on which teams should be in the league or not, we should adjust the boundary a bit so that it only accepts teams where both of their drivers performed well. One way we can do that is by creating a curve.

Now, only teams to the left of, and under the curve will be accepted. Note that the equation of a curve is more complicated than the equation of a line.

This is one example of how we can use neural networks to create a model that can classify data points and use it to predict the label of a new data point.

However, the racing league likely isn’t going to be just using the driver’s lap times. They also might be adding the performance of the cars into the equation. So, now it might look something like this:

The x- and y-axes are the drivers’ lap times, and the z-axis is the car's performance (performance rating). All points above the boundary are teams that make it into the racing league. The z-axis is denoted as x3.

Here, instead of plotting our points in a two-dimensional space, we’re now doing it in a three-dimensional space. If we were to keep adding on more independent variables, we would be plotting the points in the n-th dimension, and the boundary would always be in the (n-1)th dimension.

Perceptrons

A neural network has several parts. It has nodes, which are things that store data and apply functions to it, then pass it on to the next layer using things called weights. The value of a weight determines how much its corresponding data contributes to the overall output of the neural network. It also has a bias, which essentially shifts the activation function depending on how you want it. Then, the neural network also has three sections: the input layer, hidden layers, and output layer.

  • Input layer: the input layer stores data from what is given to it. Each of the nodes then applies functions to the data and sends it to the next layer.
  • Hidden layers: in this example, there is only one hidden layer, but there can be as many as the creator of the neural network wants. The hidden layer is where most of the calculations occur. They apply activation functions to the data that they receive and send it through weights to the final layer, the output layer (or to the next hidden layer).
  • Output layer: the output layer doesn’t have to consist of just one node. For example, if you are given a handwritten digit and you want to find the probability that it is each digit from 0–9, you can include 10 output nodes. In the output layer, an activation function like the sigmoid or threshold is used (more on this later in the article) to give a final output.

In this section, I’m going to go into detail about what a perceptron (or node) is and its role in the neural network.

When you think about logic gates, you think about how each gate either takes in one or two inputs, then sends an output along. For example, take an AND gate. It only returns 1 if both of its inputs are 1, otherwise, it returns a 0. An OR gate returns a 1 if at least one of its inputs is 1, otherwise, it returns a 0. A NOT gate returns the opposite of its input. If it gets a 1, it returns a 0, and vice-versa.

Perceptrons can be thought of in the same way. They receive inputs, do something with what they are given, and pass it along.

This is an example of how a perceptron would take inputs, apply an activation function, and pass along an output. In the ReLU activation function specifically, if the input is negative, it outputs 0. If the input is positive, it returns the input.

Normally in neural networks, we use the ReLU activation function a lot because it increases the computing power required for the neural network linearly, not exponentially.

Each of the nodes in a neural network is a perceptron. They each take in the inputs from the previous perceptrons, then generate their output to pass to the next layer. Only the input layer doesn’t take its inputs from perceptrons, instead, that information is just given to them. Also, the output layer takes its inputs and passes them through an activation function, mainly the sigmoid or threshold function.

Weights and Biases

When looking at the last image, you were probably wondering what w1 and w2 were. These are called weights and they play a crucial role in deciding which parts of information are more important than others.

There is also another part to finding out the input from the output, and that is the bias. Although it wasn’t shown in the perceptron diagram, it is added to the rest of the equation.

Let’s look at the equation of a line.

Here, if you increase the bias, then the line shifts to the right, and if you decrease the bias, the line shifts to the left. x1 and x2 are the inputs, w1 and w2 are the corresponding weights, and b is the bias.

What’s happening here is the neural network has a line (we’ll use a line for simplicity’s sake), and it wants to shift the line and change it so that it has as many blue points as it can in the positive region (left of the line) and as many red points as it can in the negative region (right of the line). The process of doing this is called backpropagation.

Activation Functions

In this section, I’m going to quickly go over some of the most common activation functions that we use in neural networks.

ReLU

I already went over the ReLU activation function, but it essentially returns an output of 0 if the sum of its inputs is 0 or less than 0. If the sum of its inputs is positive, then it returns the sum of its inputs.

Sigmoid

The sigmoid function is good for returning a probability that something is going to happen. It ranges from 0 to 1, 1 meaning certain and 0 meaning certainly not.

Softmax

Softmax is good for determining the probability that something is a part of each class.

Here is an image (left) from the Fashion MNIST dataset. As you can see on the right, I used the softmax function, which gave me the probability of the image on the left being in each class. It mainly thinks that the picture is a dress, with a prediction of about 90%, and then it says there is about a 5% chance that it is a t-shirt. The rest of the classes’ probabilities are negligible.

Feed-forward

Feedforward is just how a neural network derives an output from its input. Although intuitively, it seems simple, the mathematics behind it is pretty complicated.

Here, I’m making a neural network diagram that includes the bias as a node/perceptron. We have two features, x_1 and x_2, and then we also include the bias as a value of 1 in the input layer. We do this so that we only have to adjust the weights to adjust the bias because the value in the node gets multiplied by the weight (no values lead into the bias). When we multiply the weight by one, it’s the same value. We then add our bias again in the hidden layer, and in the hidden layer, we’re also using sigmoid functions to get values between 0 and 1. In the end, we use one last sigmoid function to get a final value between 0 and 1.

The mathematical way of getting the final sigmoid prediction (y-hat) is as follows:

Looks complicated, let me break it down for you.

What’s going on here is we are applying the sigmoid function to the second set of weights, W^(2). This layer of weights helps us get a nonlinear model by combining the weights in the first layer. Then, in the second section of this equation, we are doing the sigmoid function of that 3-by-3 matrix, which is each of the features (x_1, x_2, 1) multiplied by each of the weights.

This part is hard to understand, and I’m still trying to understand all of it. So, don’t feel discouraged if you don’t get it right away.

Backpropagation

We’ve looked at how a neural network derives an output from its inputs, and we’ve looked at the equation Wx + b = 0. Now, let’s look at how a neural network takes its output, sees how far it is from the ideal solution, and then adjusts its weights to get closer to the ideal solution.

Error Functions

If you’re in a car and you want to get to a certain place, you can call the distance between your car and your destination the error. It tells you that this is how far you are from arriving at your destination.

Unfortunately, you don’t have something that can plan the fastest route. So, to get to your destination you look at all the possible directions in which you can drive, and you take the one that brings you the closest to your destination, the fastest. Then, you keep repeating that process until you get to your destination. In this case, we can call the distance to the destination our error. This is called gradient descent.

Log-loss

The blue area is what the neural network thinks should be blue points, and the red area is what the neural network thinks should be red points. This neural network isn’t doing very well because it misclassifies two points.

To promote correctly classified points and penalize incorrectly classified points, the log-loss error function assigns a very small penalty value to points that are correctly classified and a large penalty value to points that aren’t. When the point is misclassified, the penalty is the distance between the point and the boundary. The point’s penalty is almost 0 when the point is correctly classified.

The total error is just the sum of the error of each point.

The next step is to make small changes and see if the total error decreases. If we go back to our car example, once we move towards the destination, our error decreases, so we know we’re on the right track. In the case of neural networks, this process is called gradient descent. We are making slight adjustments in the weights and bias to arrive at a different spot.

Gradient Descent

To decrease the error function the most, we take the vector sum of the partial derivatives with respect to w_1. We find the vector that will increase the gradient the most, then take the exact opposite of that to decrease the gradient as much as we can. We keep repeating this process until we get the lowest error possible.

Y-hat here is the output of the neural network, and nabla (the upside-down delta sign) E means the error function. This calculates the derivatives with respect to the weights and bias, and then to find the fastest way to go downward, we multiply the result by -1.

While doing this seems like a good idea, there’s one mistake that we’re making. When we make these steps, it’s usually not a good idea to take large steps. So, we introduce something called the learning rate. Then, to update the weights we take the current weights and subtract (because the gradient we want is negative) the learning rate multiplied by the partial derivative of the error, with respect to w_i.

And that is how neural networks learn! I hope that now you have a basic understanding of how neural networks work. Now, I’m going to go over a case that happens very often when trying to train a neural network.

Overfitting

Overfitting is when the model becomes too used to the training data and performs poorly on test data. What do I mean by this?

When we are training our neural network, we want to give it different data than what we will be using to test it. If we give it the same testing data as training data, it will just memorize the answers, and we don’t want that. We want the neural network to be able to figure the answers out.

The problem is, however, that when we let it train itself repeatedly on training data, it starts to memorize the training data, then starts forming complex boundaries that only fit the training data. Now, while this means that it will perform exceptionally on training data, you can probably guess how bad it’s going to perform when it handles data it has never seen before.

Here, it is forming boundaries too tightly around the points. When a new point is introduced, it doesn’t even know how to classify it because it’s outside the boundaries.
Here, you can see that as the number of epochs (runs through the dataset) increases, the loss (error) decreases. However, as the neural network gets more used to the data, it starts to increase its loss on validation (test data).

Dropout

Dropout is a very effective way to combat the problem of overfitting. What happens during overfitting is that it starts to use certain weights more than others. Normally this is okay, but when overfitting starts to happen it gets to a point where it’s doing more harm than good to the overall performance of the neural network.

Dropout is a method to help the neural network “get used to” using all of its weights, and it does this by kicking a few nodes out during the training.

Essentially, a probability is assigned to each node in the hidden layer(s) that they will be excluded from the training process for each epoch. This makes it so that if a node that the neural network heavily relies on gets dropped out, it now has to find new ways and use the weights on the other nodes more.

Here, you can see that after using dropout the validation (test) loss is about the same as the training loss, which is what we want to see.

Final Thoughts

A solid foundation of understanding neural networks is really beneficial when you get into more complex things like computer vision, GANs, and more. I strongly believe that artificial intelligence has the power to change so many aspects of our lives. You can create a neural network to learn to do pretty much anything you can think of, and that creates so many new opportunities for our future, allowing us to grow exponentially. Soon, we may even have artificial general intelligence, where machines can learn at the same rate as us. Who knows what the future holds for us and our research in artificial intelligence?

If you made it this far through the article, thank you for reading it all the way through. However, there still might be some parts that I could have explained better. If you have any questions, feel free to leave them in the comments and I’ll do my best to answer them. Also, let me know in general how much you liked this article! Give me feedback! I want to keep improving these so you can get more value out of reading them.

Now that I have a better understanding of how neural networks actually work, I want to look more into reinforcement learning because that was a topic that really piqued my interest. Seeing how a neural network can go from making random moves to understanding core game mechanics is just fascinating to me, so that’s the application of neural networks that I will explore next.

Again, thank you for reading all the way through. I appreciate it a lot. Stay tuned for my next article!

--

--

Sohum Padhye
Sohum Padhye

Written by Sohum Padhye

Passionate about AI and Web Dev.

Responses (1)