160 Questions for Data Science - Part 10

Intro (10/?)

I totally forgot about writing some new stuff here, but recently I found these 160 data science interview questions on Hackernoon, and decided to try to answer each one of them in order to force me to study all of those interesting topics. I will post my answers (hopefully, right and comprehensible) trying to write ~23 answers each couple of days.

If you spot anything wrong, contact me please!

Precision-recall trade-off ‍

The precision of a classification model (binary) is the ratio between True Positives and the sum of True Positives + False Positives. If a model is biased towards recognizing a sample as positive even when it’s not, this value will decrease.

The value of recall is the ratio between the sum of True Positives and the sum of True Positives and False Negatives. If the model is biased towards missing positive samples and evaluating them as negatives, this value will decrease.

In an ideal scenario, the samples are totally separable and those evaluators both achieve the maximum value of 1. In the real world, though, one can find themselves in a situation when the threshold of a model needs to be shifted to increase one of the values by decreasing the other. This choice depends on the specific data, for example in a medical situation false positives are usually less risky than false negatives.

What is the ROC curve, when to use it

The ROC curve is the curve obtained when plotting the True Positive Ratio (the same value as recall) over the False Positive Ratio (the ratio between False Positives and the sum of True Negatives and False Positives) at various thresholds. This curve will always go from [0, 0] to [1, 1], since when the threshold is lower than the lowest sample and highest than the highest sample all of the samples will be evaluated the same.

This curve can be seen as the combination of the plotting values of both Ratios. When the two ratios are plotted against the threshold value, the superposition of the two curves will reflect in the ROC curve shape.

For example, a totally random model will have both the ratios’ curves totally superimposed, and the ROC curve will be a line that goes from [0, 0] to [1, 1].

The shape of the ROC curve helps us to identify how the model behaves when the threshold changes.

What is AUC (AU ROC), when to use it

The AUC is the Area Under the Curve of the ROC curve. Its value can go from 0 to 1, but a model that has a AUC lower than 0.5 is a model that behaves worst than a random classifier, therefore is not of much use.

It can be used as a metric for the total precision of a model: the higher the AUC value, the more stable the model is when changing the threshold, and that means that the samples are better separated.

One way to intepreting this value is the probability that a random positive sample is ranked higher than a random negative example.