Feedforward neural networks

Recapituation of supervised learning systems studied thus far

We have studied how linear regression can help us predict continuous values via a linear relationship with a set of features.
We have also studied how we can use logistic regression and softmax to classify datapoints into different discrete categories.
The equations for these systems are:
- Linear regression $\hat{y}_i = x_iW + b \in \mathbb{R}^{1x1}$
- Logistic regression $\hat{y}_i = \sigma(x_iW + b) = \frac{1}{1 + e^{-(x_iW + b)}} \in \mathbb{R}^{1x1}$
- Softmax $\hat{y}_i=softmax(x_iW + b) = \frac{e^{x_iW + b}}{\sum_je^{x_iW + b}} \in \mathbb{R}^{1xC}$
If you notice, we have
- used $W$ and $b$ to transform the input data $x_i$ .
- the result of this transformation gets further processed by a function to get the output $\hat{y}_i$ .
However, we can apply more transformations to the input data before reaching the final output.

Instead of computing $\hat{y}_i$ directly, we can use input data $x_i$ to compute a “hidden state” $h = f(x_iW + b) \mathbb{R}^{1xH}$ .
The function $f()$ that gives the hidden state is non-linear and arbitrary (although some non-linearities are more commonly used).
If we used this hidden state with the systems we have studied so far in this class, their equations would be:
- Linear regression neural network $\hat{y}_i = h_iW + b \in \mathbb{R}^{1x1}$
- Logistic regression neural network $\hat{y}_i = \sigma(h_iW + b) = \frac{1}{1 + e^{-(h_iW + b)}} \in \mathbb{R}^{1x1}$
- Softmax neural network $\hat{y}_i=softmax(h_iW + b) = \frac{e^{h_iW + b}}{\sum_je^{h_iW + b}} \in \mathbb{R}^{1xC}$
These equations represent three different neural networks, each of which could be used for different tasks and applications.

Today, the most commonly used non-linearity is ReLU (Rectified Linear Unit) or some of its variants.
Other useful non-linearities include sigmoid and the hyperbolic tangent (tanh).
Advantages of ReLU over other non-linearities include:
- the fact that it does not “saturate”
- its computation does not involve an exponentiation and instead is a “thresholding” operator
- it results in the network learning a sparse representation

In the previous sections we talked about a neural network with one hidden layer.
However, we can have more hidden layers. In fact, you can have as many as necessary (how do you know how many are necessary?).
The resulting compuational architecture is known as a feedforward neural network.