Removing outliers

If you've got a sequence of data that you want to do some statistical analysis on, but you know that some of it is bad, how do you remove the bad data? You could just remove the top 5% and bottom 5% of values, for example, but maybe you don't have any bad data (or maybe you have lots) and that would adverse affect your measurement of the standard deviation.

Suppose that you know that your dataset is supposed to follow a normal distribution (lots do). Then you could remove outliers by measuring the skew and kurtosis (3rd and 4th moments), and just repeatedly remove the sample furthest from the mean until these measurements look correct. This algorithm is guaranteed to terminate since if you only have 2 samples the skew and kurtosis will be 0. You've still got a parameter or two to tune though (how much skew and kurtosis you'll tolerate).

One Response to “Removing outliers”

  1. Felipe Lopez says:

    Hello Andrew! i sent you an email, plz check it.

    regards!

Leave a Reply