Logistic regression, binary cross-entropy, and error metrics

Logistic regression

Last class we talked about how we can use linear regression to model linear relationships between features $X \in \mathbb{R}^{N\times D}$ and a target $Y \in \mathbb{R}^{N}$ .
But not all relationships between features and targets are linear. Sometimes the relationship is categorical (i.e. different instruments, or different musical genres).
Now picture this scenario:
- You have $N$ datapoints, each with $D$ features, and you have organized them in a matrix $X \in \mathbb{R}^{N\times D}$ .
- Half of these datapoints have features extracted from Violin tones, while the other half were extracted from Bass Tuba tones.
- You also have a vector $Y \in \mathbb{R}^{N}$ , which is filled with zeros or ones, with each “zero” indicating that the features in the corresponding row of $X \in \mathbb{R}^{N\times D}$ were extracted from a Violin tone, and each “one” indicating Bass Tuba.
You can use the logistic regression formula to find a vector $w \in \mathbb{R}^{D}$ and a bias term $b \in \mathbb{R}^{1}$ that allow you to transform the features into values between $0$ and $1$ .
The logistic regression formula is $\hat{y} = \sigma(\theta) = \frac{1}{1 + e^{-\theta}} \in \mathbb{R}^{N}$ , where $\theta = Xw + b$ .
Once we have transformed our features $X$ into $\hat{y}$ , we can define a threshold (usually 0.5) under (above) which all values in $\hat{y}$ will be treated as zeros (ones).
With this procedure, we can assess the performance of our logistic regression model against the ground truth data $y$ .
Question: how many parameters does logistic regression involve? How about linear regression? Why?

Binary cross-entropy

Last class, when we optimized linear regression, we used the function $J = \frac{1}{2}\frac{1}{N} \sum_{i=1}^N (y_i - \hat{y_i})^2$ .
For logistic regression we must use the binary cross-entropy loss, which is defined by $J = -\frac{1}{N} \sum_{i=1}^{N} y_i log(\hat{y}_i) + (1-y_i)log(1-\hat{y}_i)$ (the origins of this function come from statistics. If you are curious, you should take or review the materials for an introductory machine learning class, like Stanford’s CS229).
Inspecting the binary cross-entropy loss, you can see that when $y_i=0$ , $J = -\frac{1}{N} \sum_{i=1}^{N} (1-y_i)log(1-\hat{y}_i)$ . In contrast, when $y_i=1$ , $J = -\frac{1}{N} \sum_{i=1}^{N} y_ilog(\hat{y}_i)$ .
When minimizing the binary cross-entropy loss using an algorithm like gradient descent, what we are effectively doing is making $y$ and $\hat{y}$ as simiar to each other as possible.
Question: why does the binary cross-entropy loss have a negative sign at the beginning?

Error metrics for binary classification

When we are done optimizing our logistic regression model, we must evaluate it using our validation data splits (also, remember the evaluation data)?
It’s very common practice to calculate a confusion matrix, which tells us the number of:
- true positives
- false negatives
- false positives
- true negatives
Having the confusion matrix, it is also easy to calculate the:
- overall model accuracy
- true positive rate
- true negative rate
- false positive rate (type-i error)
- false negative rate (type-ii error)
Error metrics are essential to interpret how our model performs on the different data splits of cross-validation.

Optimizing and evaluating logistic regression