contr.sum(3)
[,1] [,2]
1 1 0
2 0 1
3 -1 -1
Deviation coding of factors (also known as “sum to zero contrasts”) as implemented in R’s contr-sum
produces coefficients which represent the deviation under the given level from the mean of levels.
This interpretation is at first hard to reconcile with the contrasts contr.sum
returns:
contr.sum(3)
[,1] [,2]
1 1 0
2 0 1
3 -1 -1
They seem to code for the difference between the first
Let’s derive regressors whose coefficients represent the deviations from scratch. We start with a design matrix using dummy variables (or “treatment coding”) without a constant regressor.
<- rbind(diag(3), diag(3))) (X
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
[4,] 1 0 0
[5,] 0 1 0
[6,] 0 0 1
Estimating coefficients for these regressors amounts to calculating the mean over all values belonging to the level.
<- cbind(c(1, 6, 5, 3, 2, 11))) (y
[,1]
[1,] 1
[2,] 6
[3,] 5
[4,] 3
[5,] 2
[6,] 11
pinv(X)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.5 0.0 0.0 0.5 0.0 0.0
[2,] 0.0 0.5 0.0 0.0 0.5 0.0
[3,] 0.0 0.0 0.5 0.0 0.0 0.5
<- pinv(X) %*% y) (b
[,1]
[1,] 2
[2,] 4
[3,] 8
Once we have these coefficients, calculating deviations is simple.
<- eye(3) - ones(3) / 3) (C
[,1] [,2] [,3]
[1,] 0.6666667 -0.3333333 -0.3333333
[2,] -0.3333333 0.6666667 -0.3333333
[3,] -0.3333333 -0.3333333 0.6666667
<- C %*% b) (d
[,1]
[1,] -2.6666667
[2,] -0.6666667
[3,] 3.3333333
So how would we have to set up regressors to get deviations directly as coefficients? We have
and
and we want
and therefore for the new design matrix it needs to hold
but the result is rank deficient, too. What
Instead, we take the pseudoinverse of C with the last row removed
pinv(C[-3, ])
[,1] [,2]
[1,] 1.000000e+00 1.110223e-16
[2,] 1.665335e-16 1.000000e+00
[3,] -1.000000e+00 -1.000000e+00
which is identical to the result of contr.sum(3)
apart from rounding errors.
After adding the constant regressor, the resulting design matrix is
<- cbind(ones(6, 1), X %*% pinv(C[-3, ]))) (Xcs
[,1] [,2] [,3]
[1,] 1 1.000000e+00 1.110223e-16
[2,] 1 1.665335e-16 1.000000e+00
[3,] 1 -1.000000e+00 -1.000000e+00
[4,] 1 1.000000e+00 1.110223e-16
[5,] 1 1.665335e-16 1.000000e+00
[6,] 1 -1.000000e+00 -1.000000e+00
and it estimates the mean as well as the deviations from the mean for the first two levels:
pinv(Xcs) %*% y
[,1]
[1,] 4.6666667
[2,] -2.6666667
[3,] -0.6666667
Therefore contr.sum
indeed configures a factor for estimation of deviations from the mean, albeit for one less than the number of levels. The deviation for the last level can be obtained as the negative sum of the other deviations.
What appeared as a “reference level” is just the left-out level.
contr.sum
does not provide the ability to specify the left-out level. memisc:contr.sum
has that ability, and also creates better contrast names.
<- data.frame(
data y = c(1, 6, 5, 3, 2, 11),
f = factor(c("A", "B", "C", "A", "B", "C"))
)contrasts(data$f) <- memisc::contr.sum(levels(data$f), base ="A")
contrasts(data$f)
B C
A -1 -1
B 1 0
C 0 1
lm(
~ f,
y data = data
)
Call:
lm(formula = y ~ f, data = data)
Coefficients:
(Intercept) fB fC
4.6667 -0.6667 3.3333
There is actually no need to leave out a level. If coefficients are estimated using the pseudoinverse, the mean and the deviations can be obtained together.
<- cbind(ones(6, 1), X %*% pinv(C))) (Xd
[,1] [,2] [,3] [,4]
[1,] 1 0.6666667 -0.3333333 -0.3333333
[2,] 1 -0.3333333 0.6666667 -0.3333333
[3,] 1 -0.3333333 -0.3333333 0.6666667
[4,] 1 0.6666667 -0.3333333 -0.3333333
[5,] 1 -0.3333333 0.6666667 -0.3333333
[6,] 1 -0.3333333 -0.3333333 0.6666667
pinv(Xd) %*% y
[,1]
[1,] 4.6666667
[2,] -2.6666667
[3,] -0.6666667
[4,] 3.3333333
Unfortunately, contrasts(data$f) <- pinv(C)
doesn’t work.