14.3 Mann-Whitney U tests
14.3.1 Introduction
A Mann-Whitney U (also called a Wilcoxon rank-sum test) is a non-parametric form of the independent samples t-test. In other words, it applies to situations where you are comparing two independent groups, and for whatever reason the assumptions of an independent t-test are severely violated.
Note that many statistics webpages erroneously call the Mann-Whitney U a test of medians; this is not necessarily true (and even the distribution point is a little strained). The test is simply on the ranks of the data.
14.3.2 Hypotheses
- \(H_0\): The probability distributions of the two groups is the same (i.e. they derive from the same distribution).
- \(H_1\): The probability distributions of the two groups are not the same (i.e. they derive from different distribution).
The test statistic is the U statistic. The U statistic ranges from 0 (which implies complete separation between the two groups) and n1 * n2 (the sample sizes of both groups multiplied).
14.3.3 Example
The dataset for this page and the next relate to young men’s wages in 1980 and 1987 across the United States. The original study was interested in the effects of union bargaining/membership on wages.
The following variables are in the dataset:
- nr: Participant ID
- year: year of measurement (1980 or 1987)
- school: Years of schooling
- exper: Years of work experience, calculated as school - 6
- union: Was their wage set by collective bargaining? (two levels: yes, no)
- ethn: Participant ethnicity (three levels: black, hisp, other)
- married; Marital status (two levels: yes, no)
- health: Does the participant have a health problem? (two levels: yes, no)
- wage: Hourly wage, log-transformed
- industry, occupation, residence: Demographic and descriptive variables
## Rows: 1090 Columns: 13
## ── Column specification ────────────────────────
## Delimiter: ","
## chr (7): union, ethn, married, health, industry, occupation, residence
## dbl (6): ...1, nr, year, school, exper, wage
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Consider the following question: In 1980, were wages higher for union members than non-union members?
Let’s take a look at the data in R first. First, let’s filter our dataset so that we only have cases from 1980.
Pretend that we run our assumption checks on the wage data and obtain the following:
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
Both assumptions have been violated. Now, pretend that we think this violation is bad enough that even a Welch test wouldn’t be appropriate. In this instance, we may turn to a Mann-Whitney U test.
14.3.4 Output
TO run a Mann-Whitney U test, we use the wilcox.test() function in R. The wilcox.test() function behaves just like the regular t.test() function for both independent and paired-samples t-tests, down to the same notation. So, we can use the same notation for a independent-samples t-test as we have done so in the past:
##
## Wilcoxon rank sum test with continuity correction
##
## data: wage by union
## W = 19767, p-value = 2.898e-07
## alternative hypothesis: true location shift is not equal to 0
Here is our output above. We can see that the p-value is significant, and so therefore these two samples (union vs non-union) do not appear to come from the same underlying distribution (Mann-Whitney U = 19767, p < .001). We can then use descriptives as per normal to figure out where the difference is (the median and mean wages for union members are higher than non-members).
We also want to calculate our effect size for this test, called the rank biserial correlation. We won’t worry too much about the maths here, but we can broadly interpret this along similar lines to Pearson’s r (weak to medium in this instance). To do this, we can use the rank_biserial() function in the effectsize package, which works like its cohens_d() counterpart:
Something to note, though, is that unlike the standard t-test, how exactly this result should be interpreted is a little more vague. With a regular t-test, we test differences between two group means, and thus we can directly make a comparison between means when interpreting a test. In this instance, however, we are testing differences in ranks; this doesn’t have a clean interpretation beyond there just being a difference (of sorts) between the groups.