Cross validation and linear regression

Cross validation

No matter how big a dataset is, at the end of the day it has a limited number of datapoints.
Data is very valuable, and to develop robust machine learning methods, we must use it carefully. WARNING: the improper use of data is guaranteed to be a waste of your time.
Cross-validation is a simple and robust method to carefully use data to systematically analize how well a model can perform.
To carry out cross-validation, we must randomly split our data into two sets:
- A development set with ~95% of the data
- And a testing set, with ~5% of the data
  - We will save the testing set to assess our model (or models) at the very final testing phase of the development cycle.
Cross-validation further splits randomly the development data into a C number of folds (usually five) of equal size.
Then, we will train our model (from scratch) using C-1 of the development folds, and we will validate (i.e. assess model performance) using the remaining fold.
We will repeat this procedure until we have used each of the C folds as the validation fold, and we will calculate the average cross-validation accuracy across folds.
Note: some datasets have already been split into testing and development sets, and the development set can also be already split into fixed training and validation sub-sets. In this situation it is recommended to follow the suggested training and validation data splitting intead of creating training and validation folds.
Why is all of this necessary?

Given a datapoint $x_i \in \mathbb{R}^{D}$ (i.e. with D features) and a dependent observation $y_i \in \mathbb{R}^{1}$ , can we use the equation of a line to describe the relationship between $x$ and $y$ ?
Picture this scenario:
- you have a dataset consisting of musical tones, each with a fundamental frequency (f0).
- You calculate the average zero-crossing rate per second for each of these tones.
- Assuming there is a linear relationship between zero-crossing rate and f0, if the zero-crossing rate is $y_i$ and f0 is $x_i$ , you could use a line and an error term to model the data using the equation $y_i=wx_i + b + \epsilon_i$ .
To assess this model, we must assume there will be an error $\epsilon$ that we will have to “tolerate”, and find the variables $w$ and $b$ using the objective function $J = \frac{1}{2}\frac{1}{N} \sum_1^N (y_i - \hat{y_i})^2$ , where $N$ is the number of datapoints we are using to calculate the objective, and $\hat{y_i}=wx_i + b$ .
The objective function is effectively half the mean squared error $2J = \sum_1^N \epsilon_i^2$ between our model prediction $\hat{y_i}$ and the ground-truth value $y_i$ .
We can use a simple method like gradient descent (review if necessary) to find the optimal $w$ and $b$ so that $J$ is as low as possible.