Suppose a linear regression model for average daily humidity contains attributes for the day-of-month and the temperature. Suppose further that in the data these attributes are correlated, perhaps because the temperature rose one degree each day data was collected. When computing the expectation function , overly large positive estimates of could be transparently offset by large negative values for , since these attributes change together. More complicated examples of this phenomenon could include any number of correlated variables and more realistic data scenarios.
Though the expectation function may hide the effects of correlated data on the parameter estimate, computing the expectation of new data may expose ridiculous parameter estimates. Furthermore, ridiculous parameter estimates may interfere with proper interpretation of the regression results. Nonlinear regressions often use iterative methods to estimate . Correlations may allow unbound growth of parameters, and consequently these methods may fail to converge.
The undesirable symptoms of correlated attributes can be reduced by restraining the size of the parameter estimates, or preferring small parameter estimates. Ridge regression is a linear regression technique which modifies the RSS computation of Equation 3.7 to include a penalty for large parameter estimates for reasons explained below. This penalty is usually written as where is the vector of slope parameters . This penalty is added to the RSS, and we now wish to minimize
Ridge regression was originally developed to overcome singularity when inverting to compute the covariance matrix. In this setting the ridge coefficient is a perturbation of the diagonal entries of to encourage non-singularity . The covariance matrix in a ridge regression setting is . Using this formulation, Ridge regression may be seen as belonging to an older family of techniques called regularization .
Another interpretation of Ridge regression is available through Bayesian point estimation. In this setting the belief that should be small is coded into a prior distribution. In particular, suppose the parameters are treated as independent Normally-distributed random variables with zero mean and known variance . Suppose further that the outcome vector is distributed as described in Section 3.1 but with known variance . The prior distribution allows us to weight the likelihood of the data by the probability of . After normalization, this weighted likelihood is called the posterior distribution. The posterior distribution of is