160 Questions for Data Science - Part 5

Intro (5/?)

I totally forgot about writing some new stuff here, but recently I found these 160 data science interview questions on Hackernoon, and decided to try to answer each one of them in order to force me to study all of those interesting topics. I will post my answers (hopefully, right and comprehensible) trying to write ~23 answers each couple of days.

If you spot anything wrong, contact me please!

What is overfitting

Whenever one is trying to create a model, there are 2 extremes that must be avoided: the first, and simpler one, is to build a model that is totally unfit to the data. This kind of model won’t be able to represent the dataset or to predict any new value.

On the other side, one must not try to fit the model perfectly to each point in the dataset. This is what could lead to overfitting: it’s the situation in which a model is so perfectly fit to the dataset that will try to represent all kind of data, even random noise. This will of course create problem when trying to predict new data, since a model that is too much adapted to existing data won’t be able to adapt to new datapoints, even less so when new datapoints are ouside the current data range.

How to validate your models

Model validation is done with multiple metrics and instruments: one can check the model accuracy (with Mean Square Error), the model information retention (with R squared) and the correctness of the fitting of the model, with methods like cross-validation. In cross-validation, the validity of the model is tested against a dataset different, but with the same features, from the training dataset. Usually, this validation dataset is simply a part of the original dataset not used for training the model.

Why do we need to split our data into three parts: train, validation, and test

When creating a model, it’s important to test its accuracy and its correctness. The dataset used to create the model is divided in 3 parts, with the train part usually being the biggest one. The model is trained against the train dataset, trying to minimize some error function during the training. Since it’s been run against the train dataset in order to adapt to it, it’s not possible to use the results of the train dataset fit as metrics to evaluate the model when it’s ready.

Then the model is run against the validation part, of which we know the dependent variable values. Nonetheless, we will try to predict their value, and if the model is accurate, but not overfit, the prediction values will be close enough to the actual values. This validation part is used to tone the hyperparameters of the model (for instance, in polynomial regression, the grade of the polynomial function used).

Finally, in order to evaluate without any bias from prior data the model itself, it’s run (once) against the test dataset, and the accuracy of its prediction can be used as a metric for the current model.