next up previous contents
Next: 4. Logistic Regression Up: 3. Linear Regression Previous: 3.1 Normal Model   Contents


3.2 Ridge Regression

Suppose a linear regression model for average daily humidity contains attributes for the day-of-month and the temperature. Suppose further that in the data $ \mathbf{X}$ these attributes are correlated, perhaps because the temperature rose one degree each day data was collected. When computing the expectation function $ f(\ensuremath{\mathbf{x_i}}, \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}) =
\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\mathbf{x_i}}$, overly large positive estimates of $ \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_{\mbox{temperature}}$ could be transparently offset by large negative values for $ \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_{\mbox{day-of-month}}$, since these attributes change together. More complicated examples of this phenomenon could include any number of correlated variables and more realistic data scenarios.

Though the expectation function may hide the effects of correlated data on the parameter estimate, computing the expectation of new data may expose ridiculous parameter estimates. Furthermore, ridiculous parameter estimates may interfere with proper interpretation of the regression results. Nonlinear regressions often use iterative methods to estimate $ \hat{\ensuremath{\mathbf{\beta}}}$. Correlations may allow unbound growth of parameters, and consequently these methods may fail to converge.

The undesirable symptoms of correlated attributes can be reduced by restraining the size of the parameter estimates, or preferring small parameter estimates. Ridge regression is a linear regression technique which modifies the RSS computation of Equation 3.7 to include a penalty for large parameter estimates for reasons explained below. This penalty is usually written as $ \lambda \ensuremath{\tilde{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\tilde{\ensuremath{\mathbf{\beta}}}}$ where $ \tilde{\ensuremath{\mathbf{\beta}}}$ is the vector of slope parameters $ \beta_1, \ldots, \beta_M$. This penalty is added to the RSS, and we now wish to minimize

RSS$\displaystyle _{\mbox{ridge}} = \mbox{RSS} + \lambda \ensuremath{\tilde{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\tilde{\ensuremath{\mathbf{\beta}}}}$ (3.14)

Unreasonably large estimates for $ \tilde{\ensuremath{\mathbf{\beta}}}$ will appear to have a large sum-of-squares error and be rejected. Consider two datasets which differ only in the scale of the outcome variable, for example $ \ensuremath{\mathbf{X}},\ensuremath{\mathbf{y}}$ and $ (10^6 \times \ensuremath{\mathbf{X}}), (10^6 \times \ensuremath{\mathbf{y}})$. Because we do not penalize the offset parameter $ \beta_0$, the same slope parameters $ \tilde{\ensuremath{\mathbf{\beta}}}$ will minimize Equation 3.14 for both datasets.

Ridge regression was originally developed to overcome singularity when inverting $ \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{X}}$ to compute the covariance matrix. In this setting the ridge coefficient $ \lambda$ is a perturbation of the diagonal entries of $ \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{X}}$ to encourage non-singularity [10]. The covariance matrix in a ridge regression setting is $ (\ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{X}} - \lambda \ensuremath{\mathbf{I}})^{-1}$. Using this formulation, Ridge regression may be seen as belonging to an older family of techniques called regularization [34].

Another interpretation of Ridge regression is available through Bayesian point estimation. In this setting the belief that $ \mathbf{beta}$ should be small is coded into a prior distribution. In particular, suppose the parameters $ \beta_1, \ldots, \beta_M$ are treated as independent Normally-distributed random variables with zero mean and known variance $ \tau^2$. Suppose further that the outcome vector $ \ensuremath{\mathbf{y}}$ is distributed as described in Section 3.1 but with known variance $ \sigma^2$. The prior distribution allows us to weight the likelihood $ \mathbb{L}(\ensuremath{\mathbf{y}}, \ensuremath{\mathbf{X}}, \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}, \sigma^2)$ of the data by the probability of $ \hat{\ensuremath{\mathbf{\beta}}}$. After normalization, this weighted likelihood is called the posterior distribution. The posterior distribution of $ \hat{\ensuremath{\mathbf{\beta}}}$ is

$\displaystyle f_{\mbox{posterior}}(\ensuremath{\mathbf{\beta}} \vert  \ensurem...
...{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}$ (3.15)

The Bayesian point estimator for $ \mathbf{beta}$ is the $ \hat{\ensuremath{\mathbf{\beta}}}$ with highest probability under the posterior distribution. Since $ f_{\mbox{posterior}}(\ensuremath{\mathbf{\beta}} \vert  \ensuremath{\mathbf{y}}, \ensuremath{\mathbf{X}}, \sigma^2, \tau^2)
= -\mbox{RSS}_{\mbox{ridge}}$ with $ \lambda = \sigma^2 / \tau^2$, the Bayesian point estimator for $ \hat{\ensuremath{\mathbf{\beta}}}$ is the same as the least squares estimator given loss function RSS$ _{\mbox{ridge}}$ and an appropriately chosen prior variance $ \tau^2$. [43,10].


next up previous contents
Next: 4. Logistic Regression Up: 3. Linear Regression Previous: 3.1 Normal Model   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu