What are Non-parametric tests?
Most the statistical tests are optimal under various assumptions like independence, homoscedasticity or normality. However, it might not always be possible to guarantee that the data follows all these assumptions. Non-parametric tests are statistical methods which dont need the normality assumption and the normality assumption can be replaced by a more general assumption concerning the distribution function.
Non-parametric and Distribution-free
Often the terms non-parametric and distribution-free are used interchangeably. However, these two terms are not exactly synonymous. A problem becomes parametric or non-parametric depending on whether we allow the parent distribution of the data to depend on a finite number of parameters or keep it more general (e.g. just continuous). Thus, it depends more on how we formulate the problem. Whereas, if the problem does not depend either on the parent distribution or its parameter, then it becomes distribution-free. Hence, both parametric and non-parametric methods may or may not be distribution-free. However, distribution-free procedures were primarily made for non-parametric methods and hence, both the terms are used interchangeably.
When to use Non-parametric tests:
1. When the data does not follow the necessary assumptions like normality.
2. When the sample size is too small. Since, in that case, it becomes difficult for the data to follow the assumptions
3. Data is nominal or ordinal. For example, customer feedback in the form Strongly disagree, Disagree, Neutral, Agree, Strongly agree
4. The data is ranked. For example, customers ranks a list of products
5. The data contains outlier
6. There is a lower bound and upper bound in the measurement process beyond which it just says Not measured or Not detected
Advantages and disadvantages of Non-parametric tests:
Advantages:
1. It needs fewer assumptions and hence, can be used in a broader range of situations
2. A wide range of data types and even small sample size can analyzed
3. It has more statistical power when the assumptions are violated in the data
Disadvantages:
1. If the assumptions are not violated, statistical power of the test is significantly less than the analogous parametric tests. In a way, if assumptions are not violated, using non-parametric test will be a wastage of data
2. For large sample, it is computationally expensive.
Note that, if the data follows the assumptions (mainly the normality assumption), it is always wise to apply parametric tests. Even in some situations when the normality assumption is not met, if the sample size is large enough, parametric tests can be applied.
Below, we introduce some of the most useful non-parametric tests along with a brief python code.
Nature of hypothesis |
Non-parametric test |
Parametric counterpart |
When to use |
Simple Python code* |
One samples location (median) |
Simple sign ; Wilcoxon signed rank |
Students t |
Whether the median of the sample is equal to an assumed value in population |
#Code for simple sign test from statsmodels.stats import descriptivestats stat, p = descriptivestats.sign_test(data1) print(“single sample sign test p-value”, p) |
Paired samples location (median) |
Simple sign ; Wilcoxon signed rank |
Paired t |
Whether the median of the paired sample is equal with each other or not |
#Paired sample Wilcoxon signed rank test from scipy.stats import wilcoxon stat, p = wilcoxon(data1,data2) print(“Paired sample wilcoxon signed rank test p-value”, p) |
Two independent samples location (median) |
Wilcoxon signed rank ; Mann-Whitney U |
Fishers t |
Whether the medians of two independent samples are equal |
#Two independent sample Mann-Whitney U test from scipy.stats import mannwhitneyu stat, p = mannwhitneyu(data1,data2) print(“two sample mann whitney test p-value”, p) |
General two independent samples |
Wald-Wolfowitz run |
– |
Whether two independent samples have been drawn from the same distribution |
#Wald-Wolfowitz run test from statsmodels.sandbox.stats.runs import runstest_2samp stat, p = runstest_2samp(data1,data2) print(“two sample Wald Wolfowitz run test p-value”, p) |
Multiple samples location |
Kruskal-Wallis H |
ANOVA |
Whether more than two samples have been drawn from same distribution |
#Kruskal Wallis H test for multiple sample from scipy.stats import kruskal stat, p = kruskal(data1, data2, data3) print(“multiple sample Kruskal Wallis H test p-value”, p) |
*In most of the cases, it is a two tailed test, by default, in the python code
Conclusion:
Statistical tests are powerful tool to learn and compare samples. In this article, the concept of non-parametric tests, when to use it, various advantages, and different non-parametric tests along with their python codes are introduced. Wise utilization of these concepts will help in analyzing a wide range of sample with minimal assumptions.