160 Questions for Data Science - Part 6

Intro (6/?)

I totally forgot about writing some new stuff here, but recently I found these 160 data science interview questions on Hackernoon, and decided to try to answer each one of them in order to force me to study all of those interesting topics. I will post my answers (hopefully, right and comprehensible) trying to write ~23 answers each couple of days.

If you spot anything wrong, contact me please!

Can you explain how cross-validation works

Cross validation is a ML tecnique used to avoid overfitting in a model and strenghten its predictions. It works by randomly splitting the train set given for a model in the actual train set and a validation (test) set. When the model has been trained with the first one, it can be tested against the validation set to test its predictions.

This works since using a train set might cause the model to overfit, and therefore be highly capable of predict values of the train set, but not able to work with values outside of it. When tested against a different set than the one used to train it, the model must be able to predict unknown values in order to be accepted.

What is K-fold cross-validation

With K-fold cross validation the train set is split in K smaller sets. Then each iteration of the training is run using all but one of the sets, which is used as test set. At each iteration the test set is changed, and the train set changes too.

How do we choose K in K-fold cross-validation

First, K should be a value that divides the sample set in equal sets. Higher K values can slow down the computation time, since they will be trained on more samples. Usually K=10 is used.