160 Questions for Data Science - Part 2

Intro (2/?)

I totally forgot about writing some new stuff here, but recently I found these 160 data science interview questions on Hackernoon, and decided to try to answer each one of them in order to force me to study all of those interesting topics. I will post my answers (hopefully, right and comprehensible) trying to write ~23 answers each couple of days.

If you spot anything wrong, contact me please!

What’s the normal distribution, why do we care about it

The normal distribution is a continuous probability distribution characterized by a mean and a standard deviation. When plotted, it’s shaped like a symmetric bell centered on the mean (the value with the highest probability) and whose spread is dictated by the standard deviation. While the probability value decreases constantly moving away from the mean, it never reaches 0.

This kind of distribution is really common among real life data, and is the result of the central limit theorem, that states that the probability distribution of the mean of samples from a distribution (whose probability function can be unknown) can be approximated to a normal given a big enough number of samples (usually, > 30).

How do we check if a variable follows the normal distribution

There are several tests that can check if a random variable can be explained by a normal distribution. Some of them are graphic in nature, checking if the actual distribution of samples can be traced over a normal curve.

It’s also possible to check if the most extreme values are probable enough for the theorized normal distribution to have happened. Other tests check the skewness (the symmetry of the curve) and kurtosis (how precise is the peak) of the real distribution, comparing it against a normal distribution with the same mean and standard deviation, and returning a probability value that can be evaluated.

What if we want to build a model for predicting prices, are prices distributed normally, do we need to do any pre-processing for prices

Prices cannot go lower than 0, therefore cannot be defined as distributed normally. That said, they are usually modeled by a lognormal: while the price distribution is not normal, the log of the price is distributed normally. The result is a bell curve skewed to the right and with no values lower than 0 over the domain of price.

No idea on the preprocessing of the data for prices, but I don’t see any necessity that can be always applied.