next up previous contents
Next: 3.2 Ridge Regression Up: 3. Linear Regression Previous: 3. Linear Regression   Contents


3.1 Normal Model

Regression is a collection of statistical function-fitting techniques. These techniques are classified according to the form of the function being fit to the data. In linear regression a linear function is used to describe the relation between the independent variable or vector $ \ensuremath{\mathbf{x}}$ and a dependent variable $ y$. This function has the form

$\displaystyle f(\ensuremath{\mathbf{x}}, \ensuremath{\mathbf{\beta}}) = \beta_0 + \beta_1 x_1 + \ldots + \beta_M, x_M$ (3.1)

where $ \ensuremath{\mathbf{\beta}}$ is the vector of unknown parameters to be estimated from the data. If we assume that $ \mathbf{x}$ has a zeroth element $ x_0 =
1$, we include the constant term $ \beta_0$ in the parameter vector $ \mathbf{beta}$ and write this function more conveniently as

$\displaystyle f(\ensuremath{\mathbf{x}}, \ensuremath{\mathbf{\beta}}) = \ensuremath{\mathbf{\beta}}^T \ensuremath{\mathbf{x}}$ (3.2)

Because $ f$ is linear, the parameters $ \beta_1, \ldots, \beta_M$ are sometimes called the slope parameters while $ \beta_0$ is the offset at the origin. Due to random processes such as measurement errors, we assume that $ y$ does not correspond perfectly to $ f(\ensuremath{\mathbf{x}},
\ensuremath{\mathbf{\beta}})$. For this reason a statistical model is created which accounts for randomness. For linear regression we will use the linear model

$\displaystyle y = f(\ensuremath{\mathbf{x}},\ensuremath{\mathbf{\beta}}) + \epsilon$ (3.3)

where the error-term $ \epsilon$ is a Normally-distributed random variable with zero mean and unknown variance $ \sigma^2$. Therefore the expected value of this linear model is
$\displaystyle E(y_i)$ $\displaystyle =$ $\displaystyle f(\ensuremath{\mathbf{x}},\ensuremath{\mathbf{\beta}}) + E(\epsilon)$ (3.4)
  $\displaystyle =$ $\displaystyle f(\ensuremath{\mathbf{x}},\ensuremath{\mathbf{\beta}})$ (3.5)

and for this reason $ f$ is called the expectation function. We assume that $ \sigma^2$ is constant and hence does not depend on $ \ensuremath{\mathbf{x}}$. We also assume that if multiple experiments $ \ensuremath{\mathbf{x_i}}$ are conducted, then the errors $ \epsilon_i$ are independent.

Suppose $ \ensuremath{\mathbf{X}}$ is an $ R \times M$ matrix whose rows $ \mathbf{x_i}$ represent $ R$ experiments, each described by $ M$ variables. Let $ \ensuremath{\mathbf{y}}$ be an $ R \times 1$ vector representing the outcome of each experiment in $ \ensuremath{\mathbf{X}}$. We wish to estimate values for the parameters $ \ensuremath{\mathbf{\beta}}$ such that the linear model of Equation 3.3 is, hopefully, a useful summarization of the data $ \ensuremath{\mathbf{X}},\ensuremath{\mathbf{y}}$. One common method of parameter estimation for linear regression is least squares. This method finds a vector $ \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}$ which minimizes the residual sum of squares (RSS), defined as

RSS$\displaystyle (\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}})$ $\displaystyle =$ $\displaystyle \sum_{i=1}^R \left( y_i - \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\mathbf{x_i}} \right)^2$ (3.6)
  $\displaystyle =$ $\displaystyle (\ensuremath{\mathbf{y}} - \ensuremath{\hat{\ensuremath{\mathbf{\...
...y}} - \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\mathbf{X}})$ (3.7)

where $ \ensuremath{\mathbf{x_i}}$ is the $ i^{\mbox{th}}$ row of $ \mathbf{X}$. Note that $ \mathbf{beta}$ is the ``true'' parameter vector, while $ \hat{\ensuremath{\mathbf{\beta}}}$ is an informed guess for $ \mathbf{beta}$. Throughout this thesis a variable with a circumflex, such as $ \hat{\ensuremath{\mathbf{\beta}}}$, is an estimate of some quantity we cannot know like $ \mathbf{beta}$. The parameter vector is sometimes called a weight vector since the elements of $ \mathbf{beta}$ are multiplicative weights for the columns of $ \mathbf{X}$. For linear regression the loss function is the RSS. Regression techniques with different error structure may have other loss functions.

To minimize the RSS we compute the partial derivative at $ \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i$ with respect to $ \ensuremath{\mathbf{\beta}}_i$, for $ i=1,\ldots,M$, and set the partials equal to zero. The result is the score equations $ \ensuremath{\mathbf{X^T}} (\ensuremath{\mathbf{X}} \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}- \ensuremath{\mathbf{y}}) = 0$ for linear regression, from which we compute the least squares estimator

$\displaystyle \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}= \left( \ensuremat...
...math{\mathbf{X}} \right)^{-1} \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{y}}$ (3.8)

The expected value of $ \hat{\ensuremath{\mathbf{\beta}}}$ is $ \mathbf{beta}$, and hence $ \hat{\ensuremath{\mathbf{\beta}}}$ is unbiased. The covariance matrix of $ \hat{\ensuremath{\mathbf{\beta}}}$, cov$ (\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}})$, is $ \sigma^{-2} (\ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{X}})^{-1}$. The variances along the diagonal are the smallest possible variances for any unbiased estimate of $ \mathbf{beta}$. These properties follow from our assumption that the errors $ \epsilon_i$ for predictions were independent and Normally distributed with zero mean and constant variance $ \sigma^2$ [30].

Another method of estimating $ \mathbf{beta}$ is maximum likelihood estimation. In this method we evaluate the probability of encountering the outcomes $ \ensuremath{\mathbf{y}}$ for our data $ \ensuremath{\mathbf{X}}$ under the linear model of Equation 3.3 when $ \ensuremath{\mathbf{\beta}} =
\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}$. We will choose as our estimate of $ \mathbf{beta}$ the value $ \hat{\ensuremath{\mathbf{\beta}}}$ which maximizes the likelihood function

$\displaystyle \mathbb{L}( \ensuremath{\mathbf{y}}, \ensuremath{\mathbf{X}}, \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}, \sigma^2)$ $\displaystyle =$ $\displaystyle P( y_1 \vert \ensuremath{\mathbf{x_1}}, \ensuremath{\hat{\ensurem...
...remath{\mathbf{x_R}}, \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}, \sigma^2)$ (3.9)
  $\displaystyle =$ $\displaystyle (2 \pi \sigma^2)^{-n/2}
\prod_{i=1}^R
\exp{\left(
-\frac{1}{2 \si...
...remath{\hat{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\mathbf{x_i}})
\right)}$ (3.10)

over $ \hat{\ensuremath{\mathbf{\beta}}}$. We are interested in maximization of $ \mathbb{L}$ and not its actual value, which allows us to work with the more convenient log-transformation of the likelihood. Since we are maximizing over $ \hat{\ensuremath{\mathbf{\beta}}}$ we can drop factors and terms which are constant with respect to $ \hat{\ensuremath{\mathbf{\beta}}}$. Discarding constant factors and terms that will not affect maximization, the log-likelihood function is
$\displaystyle \ln \mathbb{L}( \ensuremath{\mathbf{y}}, \ensuremath{\mathbf{X}}, \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}, \sigma^2)$ $\displaystyle =$ $\displaystyle \sum_{i=1}^R
-\frac{1}{2\sigma^2}
(y_i - \ensuremath{\hat{\ensure...
...i - \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\mathbf{x_i}})$ (3.11)
  $\displaystyle =$ $\displaystyle - (\ensuremath{\mathbf{y}} - \ensuremath{\hat{\ensuremath{\mathbf...
...y}} - \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\mathbf{X}})$ (3.12)

To maximize the log-likelihood function we need to minimize $ (y_i -
\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\mathbf{x_...
...i - \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}^T \ensuremath{\mathbf{x_i}})$. This is the same quantity minimized in Equation 3.7, and hence it has the same solution. In general one would differentiate the log-likelihood and set the result equal to zero, and the result is again the linear regression score equations. In fact these equations are typically defined as the derivative of the log-likelihood function. We have shown that the maximum likelihood estimate (MLE) for $ \hat{\ensuremath{\mathbf{\beta}}}$ is identical to the least squares estimate under our assumptions that the errors $ \epsilon_i$ are independent and Normally distributed with zero mean and constant variance $ \sigma^2$.

If the variance $ \sigma_i^2$ for each outcome $ y_i$ is different, but known and independent of the other experiments, a simple variation known as weighted least squares can be used. In this procedure a weight matrix $ \ensuremath{\mathbf{W}} =$   diag$ ((\sigma_1^2, \ldots,
\sigma_R^2)^T)$ is used to standardize the unequal variances. The score equations become $ \ensuremath{\mathbf{X^T}} \ensuremath{\mathbf{W}} (\ensuremath{\mathbf{X}} \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}- \ensuremath{\mathbf{y}}) = 0$, and the weighted least squares estimator is

$\displaystyle \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}= \left( \ensuremat...
...^{-1} \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}} \ensuremath{\mathbf{y}}$ (3.13)

It is also possible to accommodate correlated errors when the covariance matrix is known. [30]


next up previous contents
Next: 3.2 Ridge Regression Up: 3. Linear Regression Previous: 3. Linear Regression   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu