In the first of our series, we discuss how a single-layer neural network works and how to implement it in Python. We don’t focus on optimization; instead, we write the program to reflect how we think about the algorithm, pray it runs fast enough, and optimize later if needed.

Why is it called a “neural network”?

Neural nets are composed of layers of units, or neurons. The idea is that the way in which the units are connected to each other resembles neurons in the brain.

Schematic of single layer neural net by James Loy’s excellent article on how to build a neural net in Python.

Schematic of single layer neural net by James Loy’s excellent article on how to build a neural net in Python.

Neural networks are comprised of an input layer, one or more hidden layers, and an output layer. We input an observation into the input layer; computations with the data occur in the hidden layers, and then an output is generated.

One common goal for a neural network is prediction. When we train on data, we typically want to minimize some loss function which tells us how different the model is from the data. This optimization can be done using gradient descent, batch gradient descent, or stochastic gradient descent. Here, we will use stochastic gradient descent.

The math behind it all

Consider training data $X$ which as $p$ features, $X=(X_1,X_2,\dots,X_p)$. Given $N$ nodes, we begin by creating a feature weights matrix $z$ whose elements are a linear combination of the features of $X$:

$$ z_n=w_{0,n}+\sum_{j=1}^p w_{j,n}X_j $$

with $n=1,\dots,N$. The feature bias is denoted $w_{0,n}$. Then we use an activation function $g(z_n)$ which transforms the linear combinations nonlinearly. We create yet another linear combination of the activation terms, given by

$$ f(X)=\beta_0+\sum_{n=1}^N\beta_n\left[g(z_n) \right]. $$

The output weights are denoted $\beta_n$, and the output bias is denoted $\beta_0$. Note that this process is done for every observation in the training data.

Optimization

In order for the neural net to “learn”, we must update our feature weights ($w$) and output weights ($\beta$) so that they minimize the difference between the data and model prediction. We use the loss function

$$ L=(y-f(X))^2 $$

where $y$ denotes the vector of output data. Of course, to do gradient descent, we need the gradient. Using the chain rule,

$$ \begin{align*}\frac{\partial L}{\partial\beta_n} &= \frac{\partial L}{\partial f}\frac{\partial f}{\partial\beta_n} = -2(y-f)A_n = -2(y-f)g(z_n)\\[20pt] \frac{\partial L}{\partial\beta_0} &= \frac{\partial L}{\partial f}\frac{\partial f}{\partial\beta_0} = -2(y-f)\\[20pt] \frac{\partial L}{\partial w_{j,n}} &= \frac{\partial L}{\partial f}\frac{\partial f}{\partial A_n}\frac{\partial A_n}{\partial z_n}\frac{\partial z_n}{\partial w_{j,n}} = -2(y-f)\beta_n g'(z_n)X_j\\[20pt] \frac{\partial L}{\partial w_{0,n}} &= \frac{\partial L}{\partial f}\frac{\partial f}{\partial A_n}\frac{\partial A_n}{\partial z_n}\frac{\partial z_n}{\partial w_{0,n}} = -2(y-f)\beta_n g'(z_n)\end{align*} $$

Choosing an appropriate learning rate $\alpha$, we update the feature and output weights at each iteration:

$$ \begin{align*}\beta_n &\xleftarrow{} \beta_n - \alpha*\frac{\partial L}{\partial\beta_n},\\ \beta_0 &\xleftarrow{} \alpha*\frac{\partial L}{\partial\beta_0},\\ w_{j,n} &\xleftarrow{} w_{j,n} - \alpha*\frac{\partial L}{\partial w_{j,n}},\\ w_{0,n} &\xleftarrow{} w_{0,n} - \alpha*\frac{\partial L}{\partial w_{0,n}}. \end{align*} $$