Too Many Levels

Problem: An important control variable has too many levels (ie, most levels are to rare to be interesting, and or have too few cases for the coefficients to be reliably estimated).

For an example and some ideas on how to solve it, see https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels

Solution 1: Treat the variable as a random effect

Should work well mathematically, and there are available tools in R that can be readily used. Too many levels do not imply infinite population of levels, but so what?

package: lme4

Solution 2: Collapse levels to groups with similar effect-size on Y

To just get an optimal power, collapsing levels into groups which share a similar effect size (coefficient) is enough. The problem here is to find implementations in R for this algorithm, which is described in

Regularized regression for categorical data Gerhard Tutz and Jan Gertheiss

Solution 3: Create one continous variable based on the effect size on Y

Solution 4: Fit the model to data with an algorithm that natively supports grouped lasso

package: grplasso, grpreg

grplasso: "Fitting User-Specified Models with Group Lasso Penalty" grpreg: "Regularization Paths for Regression Models with Grouped Covariates"

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: mars 31, 2021