Next: 3.2 Ridge Regression Up: 3. Linear Regression Previous: 3. Linear Regression   Contents

# 3.1 Normal Model

Regression is a collection of statistical function-fitting techniques. These techniques are classified according to the form of the function being fit to the data. In linear regression a linear function is used to describe the relation between the independent variable or vector and a dependent variable . This function has the form

 (3.1)

where is the vector of unknown parameters to be estimated from the data. If we assume that has a zeroth element , we include the constant term in the parameter vector and write this function more conveniently as

 (3.2)

Because is linear, the parameters are sometimes called the slope parameters while is the offset at the origin. Due to random processes such as measurement errors, we assume that does not correspond perfectly to . For this reason a statistical model is created which accounts for randomness. For linear regression we will use the linear model

 (3.3)

where the error-term is a Normally-distributed random variable with zero mean and unknown variance . Therefore the expected value of this linear model is
 (3.4) (3.5)

and for this reason is called the expectation function. We assume that is constant and hence does not depend on . We also assume that if multiple experiments are conducted, then the errors are independent.

Suppose is an matrix whose rows represent experiments, each described by variables. Let be an vector representing the outcome of each experiment in . We wish to estimate values for the parameters such that the linear model of Equation 3.3 is, hopefully, a useful summarization of the data . One common method of parameter estimation for linear regression is least squares. This method finds a vector which minimizes the residual sum of squares (RSS), defined as

where is the row of . Note that is the true'' parameter vector, while is an informed guess for . Throughout this thesis a variable with a circumflex, such as , is an estimate of some quantity we cannot know like . The parameter vector is sometimes called a weight vector since the elements of are multiplicative weights for the columns of . For linear regression the loss function is the RSS. Regression techniques with different error structure may have other loss functions.

To minimize the RSS we compute the partial derivative at with respect to , for , and set the partials equal to zero. The result is the score equations for linear regression, from which we compute the least squares estimator

 (3.8)

The expected value of is , and hence is unbiased. The covariance matrix of , cov, is . The variances along the diagonal are the smallest possible variances for any unbiased estimate of . These properties follow from our assumption that the errors for predictions were independent and Normally distributed with zero mean and constant variance [30].

Another method of estimating is maximum likelihood estimation. In this method we evaluate the probability of encountering the outcomes for our data under the linear model of Equation 3.3 when . We will choose as our estimate of the value which maximizes the likelihood function

 (3.9) (3.10)

over . We are interested in maximization of and not its actual value, which allows us to work with the more convenient log-transformation of the likelihood. Since we are maximizing over we can drop factors and terms which are constant with respect to . Discarding constant factors and terms that will not affect maximization, the log-likelihood function is
 (3.11) (3.12)

To maximize the log-likelihood function we need to minimize . This is the same quantity minimized in Equation 3.7, and hence it has the same solution. In general one would differentiate the log-likelihood and set the result equal to zero, and the result is again the linear regression score equations. In fact these equations are typically defined as the derivative of the log-likelihood function. We have shown that the maximum likelihood estimate (MLE) for is identical to the least squares estimate under our assumptions that the errors are independent and Normally distributed with zero mean and constant variance .

If the variance for each outcome is different, but known and independent of the other experiments, a simple variation known as weighted least squares can be used. In this procedure a weight matrix    diag is used to standardize the unequal variances. The score equations become , and the weighted least squares estimator is

 (3.13)

It is also possible to accommodate correlated errors when the covariance matrix is known. [30]

Next: 3.2 Ridge Regression Up: 3. Linear Regression Previous: 3. Linear Regression   Contents