Homepage | Course content |
Cross validation and linear regression
Cross validation
-
No matter how big a dataset is, at the end of the day it has a limited number of datapoints.
-
Data is very valuable, and to develop robust machine learning methods, we must use it carefully. WARNING: the improper use of data is guaranteed to be a waste of your time.
-
Cross-validation is a simple and robust method to carefully use data to systematically analize how well a model can perform.
- To carry out cross-validation, we must randomly split our data into two sets:
- A development set with ~95% of the data
- And a testing set, with ~5% of the data
- We will save the testing set to assess our model (or models) at the very final testing phase of the development cycle.
-
Cross-validation further splits randomly the development data into a
C
number of folds (usually five) of equal size. -
Then, we will train our model (from scratch) using
C-1
of the development folds, and we will validate (i.e. assess model performance) using the remaining fold. -
We will repeat this procedure until we have used each of the
C
folds as the validation fold, and we will calculate the average cross-validation accuracy across folds. -
Note: some datasets have already been split into testing and development sets, and the development set can also be already split into fixed training and validation sub-sets. In this situation it is recommended to follow the suggested training and validation data splitting intead of creating training and validation folds.
- Why is all of this necessary?
Linear regression
-
Given a datapoint (i.e. with D features) and a dependent observation , can we use the equation of a line to describe the relationship between and ?
- Picture this scenario:
- you have a dataset consisting of musical tones, each with a fundamental frequency (f0).
- You calculate the average zero-crossing rate per second for each of these tones.
- Assuming there is a linear relationship between zero-crossing rate and f0, if the zero-crossing rate is and f0 is , you could use a line and an error term to model the data using the equation .
-
To assess this model, we must assume there will be an error that we will have to “tolerate”, and find the variables and using the objective function , where is the number of datapoints we are using to calculate the objective, and .
-
The objective function is effectively half the mean squared error between our model prediction and the ground-truth value .
- We can use a simple method like gradient descent (review if necessary) to find the optimal and so that is as low as possible.
Cross-validation and linear regression
© Iran R. Roman & Camille Noufi 2022