3.2 Ridge Regression

Suppose a linear regression model for average daily humidity contains attributes for the day-of-month and the temperature. Suppose further that in the data these attributes are correlated, perhaps because the temperature rose one degree each day data was collected. When computing the expectation function , overly large positive estimates of could be transparently offset by large negative values for , since these attributes change together. More complicated examples of this phenomenon could include any number of correlated variables and more realistic data scenarios.

Though the expectation function may hide the effects of correlated data on the parameter estimate, computing the expectation of new data may expose ridiculous parameter estimates. Furthermore, ridiculous parameter estimates may interfere with proper interpretation of the regression results. Nonlinear regressions often use iterative methods to estimate . Correlations may allow unbound growth of parameters, and consequently these methods may fail to converge.

The undesirable symptoms of correlated attributes can be reduced by restraining the size of the parameter estimates, or preferring small parameter estimates. Ridge regression is a linear regression technique which modifies the RSS computation of Equation 3.7 to include a penalty for large parameter estimates for reasons explained below. This penalty is usually written as where is the vector of slope parameters . This penalty is added to the RSS, and we now wish to minimize

Unreasonably large estimates for will appear to have a large sum-of-squares error and be rejected. Consider two datasets which differ only in the scale of the outcome variable, for example and . Because we do not penalize the offset parameter , the same slope parameters will minimize Equation 3.14 for both datasets.

Ridge regression was originally developed to overcome singularity when
inverting
to compute the covariance matrix. In this
setting the ridge coefficient is a perturbation of the
diagonal entries of
to encourage non-singularity
[10]. The covariance matrix in a ridge regression
setting is
. Using this
formulation, Ridge regression may be seen as belonging to an older
family of techniques called *regularization* [34].

Another interpretation of Ridge regression is available through
Bayesian point estimation. In this setting the belief that
should be small is coded into a *prior distribution*. In
particular, suppose the parameters
are
treated as independent Normally-distributed random variables with zero
mean and known variance . Suppose further that the outcome
vector
is distributed as described in
Section 3.1 but with known variance . The
prior distribution allows us to weight the likelihood
of the data by the
probability of
. After normalization, this weighted
likelihood is called the *posterior distribution*. The posterior
distribution of
is

(3.15) |

The Bayesian point estimator for is the with highest probability under the posterior distribution. Since with , the Bayesian point estimator for is the same as the least squares estimator given loss function RSS and an appropriately chosen prior variance . [43,10].