I totally forgot about writing some new stuff here, but recently I found these 160 data science interview questions on Hackernoon, and decided to try to answer each one of them in order to force me to study all of those interesting topics. I will post my answers (hopefully, right and comprehensible) trying to write ~2⁄3 answers each couple of days.
If you spot anything wrong, contact me please!
Is accuracy always a good metric
No. Accuracy can be misleading when one class in the dataset vastly outnumbers the other one. As an example, one can imagine a dataset where only 5% of the items belong to class A, while 95% belong to class B. In a dataset such as this, a model that always predicts class B will have 95% of accuracy, but it’s still not a good model.
What is the confusion table, what are the cells in this table
The confusion table is a representation of how a classification model misses its predictions: it allows to understand which are the most common errors by showing how a sample is predicted when the prediction is wrong.
The table is a NxN matrix with the correct prediction on the diagonal, and for each cell [x,y] it’s possible to see how many times the model has predicted class x when the actual one was y.
What is precision, recall, and F1-score
Precision is the ratio between the total actual members of a class and the sum of total actual members and falsely predicted members: this value represent how ‘valuable’ is the prediction of a model for a specific class.
Recall is the ratio between the total actual members of a class and the sum of total actual members and the actual members of the same class predicted falsely: this value represent how much the model is capable of recognizing a class.
F1-score is a composite measure of the precision and recall, and it’s defined as the double of the ratio between the product of precision and recall and their sum. It can be seen as an accuracy measure that can avoid being biased by different size in classes.