Machine Learning Strategy Notes

Machine Learning Strategy Notes

Notes on the coursera course Structuring Machiene Learning Projects I recently started.

Intro

Orthogonalisation - varying one hyper parameter affect exactly one metric.

Examples of orthogonal hyperparameters:

Metric Hyperparameter
Fit on training set network size, optimisation algorithm
Fit on validation set regularisation, bigger training set
Fit on test set bigger validation set

Early stopping is not very orthogonal because it affect both training and validation fit.

Defining a goal

Evaluation metric It is a good practice to have a single number evaluation metric because it makes it easier to compare different models. This might mean combining several evaluation metrics using an average/harmonic mean or other approach.

Satisficing evaluation metric Evaluation metric which needs to be only below within a certain interval. As long as the metric is within the specified interval there is no value in optimising it further

Optimising evaluation metric Evaluation metric which needs to be as small/large as possible. Should always aim to optimise this metric.

Combining satisficing and optimising metrics - maximise optimising metric as long as satisfycing is within bounds.

Test and Validation sets

Test set and validation set should come from the same distribution and they should be representative of the distribution and the expected real world values. Their sizes should be big enough to represent unseen values but having them bigger than that adds no value.

Comparing to human level performance

Bayes optimal error - theoretical limit on how good an ML algorithm can perform on a particular task without overfitting. Human level performance can be an approximation for bayes optimal error when humans are good at the particular task.

Avoidable bias - difference between bayes optimal error and training set error. Intuitive explanation - if you haven’t trained the network for long enough it has a bias, which can be avoided by training for a longer time.

Variance - difference between training and validation set errors.

Bayes optimal errorTrain error 5%Validation error 10%Avoidable bias 4%Variance 5%Bayes optimal errorTrain error 5%Validation error 10%

Depending on which one is bigger, tactics for reducing bias or variance should be used. Some strategies for reducing bias or variance are:

Bias reduction Variance reduction
Bigger network size Use more data
Different optimisation algorithm Regularisation
Change architecture: RNN, CNN Change architecture: RNN, CNN
Hyperparameter search Hyperparameter search

Error analysis

Error analysis - manually go through a sample of or the whole set of errors and classify them into categories. Error analysis results can be summarised in a table such as

Sample # Category 1 Category 2
1 1 0
2 1 0
n 0 1
Total: 100 70 30

The table identifies for each sample which class it belongs to. At the bottom it sums up all the errors in the particular class. The percentage of errors in each class indicate how much accuracy can be gained/error can be reduced by addressing each particular problem:

Max % error reduction=Number of errors in Category 1Validation set size\text{Max \% error reduction}=\dfrac{\text{Number of errors in Category 1}}{\text{Validation set size}}

Incorrectly labeled data (by labeler not NN)

Incorrectly labeled examples in the training set can in most cases be ignored due to the robustness of deep learning. Incorrectly labeled examples in validation set can be treated as any other category in error analysis. If they make up a significant amount of the total number of errors, they are worth fixing. When fixing incorrectly labeled examples in validation set make sure to fix ones that the NN got right as well as the ones it got wrong, because the ones it got right are supposed to be errors.

Initial system

Build first system quick and dirty and then iterate improving different aspects.

Mismatched train and validation/test sets

Mismatched train and validation/test sets - the train set distribution is different from the validation/test set distribution. Possible cause of this: when there is not enough data for specific problem, can use data for a more general version of the problem for training while keeping the data for the specific version of the problem as validation and test sets.

Bias and variance cannot be estimated directly from the differences between Bayes optimal error and train error and train error and validation error because the train and validation sets come from different distributions. It is not clear what part of the difference between validation and train error is due to variance and what part is due to data mismatch. Solution to this problem is introducing a train-validation set with the same distribution as the train set. Difference between train-validation and train set indicates variance and difference between train-validation and validation set indicates data mismatch.

Bayes error 1%Train error 5%Train val error 10%Val error 12%Avoidable bias 4%Variance 5%Mismatch 2%Bayes error 1%Train error 5%Train val error 10%Val error 12%

To address data mismatch perform error analysis to find on what categories of examples the two sets differ and introduce more training examples into training data.

Learning from multiple tasks

Transfer learning - train a NN to do well on a particular problem then remove the last layer and add a new one with random weights and retrain on another type of problem. Retraining can be done only for the last layer or for all layers. Only one new layer can be introduced or multiple new layers. Transfer learning makes sense when:

  • Problem A and problem B have the same output
  • There is a lot more training data for problem A
  • Low level features from problem A can be useful for problem B

Multi task learning - train an NN to perform multiple tasks at the same time. Output layer has n neurons for each of n tasks. Error function is a summation of all error functions for each output target. Multi task learning makes sense when:

  • Tasks have shared lower level features
  • Similar amount of data for each task
  • Can train big enough network

Note: Given a big enough network there is no benefit in training separate neural networks.

End to end deep learning

End to end deep learning - perform all subtasks of main task using one neural network.

Pros

  • Features extracted from raw data - no human induced bias
  • Less hand designing of components needed Cons
  • Need a large amount of data
  • Excludes potentially useful hand designed components

Conclusions and applications in financial forecasting with deep learning

Most of the concepts from the course apply directly to classification problems. However it is worth thinking about how to extend these concepts to regression and macroeconomic forecasting.

Some things that might be useful for my project of financial forecasting using deep learning are:

  • Single number evaluation metric. Using a single number evaluation metric is probably a good idea, as - often it is hard to compare models using multiple metrics.
  • Ensuring that the validation and test sets come from the same distribution. This could prove to be harder in financial forecasting models because the order of the data matters. In any case it is a good idea to verify how much the two distributions differ and figure out what the effects of this are on the model.
  • Figure out weather to reduce bias or variance. In the task of financial forecasting human level performance cannot be used as a proxy for Bayes optimal error, making it harder to figure out if bias(fit on training set) or variance(fit on validation set) needs to be improved. Nevertheless, it seems like a good idea to try and determine some value for Bayes error and use this technique.
  • Error analysis. The problem of applying error analysis to regression tasks is that it is hard to determine what is considered an error because errors are continuous and not discrete like in classification. A possible solution would be to set threashold on errors and use them to determine if an output is correctly “classified”. Possible categories of errors could be determined by the variance or trend at the particular point in time.
  • Using data for a more general version of the problem for training. Since there is not a lot of data in macroeconomic forecasting this seems like it could be a good approach to explore. Some possible similar “more general problem” for which there is more data available could be microeconomic forecasting. The effects of data mismatch should be taken into account as well as the actual usefulness of the general data for the specific of macroeconomic forecasting.
  • Transfer learning seems more appropriate than the above point. Under the assumption that there are common low level features between micro and macro economic forecasting, an NN can be trained on micro data which is more plentiful and the the estimated network can be used for macro economic forecasting.
  • Multi task learning also looks very promising because multiple variables can be predicted at the same time. The success of this approach is subject to the assumption that there are common low level features between different macroeconomic variables. The effects of financial crisis often affects multiple macro variables at the same time, which might mean that multi task learning is worth a try.
  • End to end deep learning requires a lot of data and is probably not a good approach to try with macroeconomic forecasting. If anything it probably indicates that it would be a good idea to perform custom preprocessing (trend and seasonality removal/differencing) for each individual variable before fitting a model.

As a whole this course illustrated a lot of general high level engineering techniques to avoid mistakes when developing a deep learning algorithm. Even though most are directly applicable to classification problems it looks promising to try and think about them from a regression problem perspective.

Comments