next up previous contents
Next: 4.4 Logistic Regression for Up: 4. Logistic Regression Previous: 4.2 Maximum Likelihood Estimation   Contents


4.3 Iteratively Re-weighted Least Squares

An alternative to numerically maximizing the LR maximum likelihood equations is the iteratively re-weighted least squares (IRLS) technique. This technique uses the Newton-Raphson algorithm to solve the LR score equations. This process is explained in more detail below.

To compute the LR score equations we first differentiate the log-likelihood function with respect to the parameters

$\displaystyle \ensuremath{\frac{\partial}{\partial \beta_j}}\ln \mathbb{L}(\ensuremath{\mathbf{X}}, \ensuremath{\mathbf{y}}, \ensuremath{\mathbf{\beta}})$ $\displaystyle =$ $\displaystyle \sum_{i=1}^R
\left(
y_i \frac{\ensuremath{\frac{\partial}{\partia...
...a}})}
{1 - \mu(\ensuremath{\mathbf{x_i}}, \ensuremath{\mathbf{\beta}})}
\right)$ (4.9)
  $\displaystyle =$ $\displaystyle \sum_{i=1}^R
\left(
y_i \frac{x_{ij}   \ensuremath{\mu(\ensurema...
...uremath{\mu(\ensuremath{\mathbf{x_{i}}}, \ensuremath{\mathbf{\beta}})}}
\right)$ (4.10)
  $\displaystyle =$ $\displaystyle \sum_{i=1}^R x_{ij} (y_i - \ensuremath{\mu(\ensuremath{\mathbf{x_{i}}}, \ensuremath{\mathbf{\beta}})})$ (4.11)

where $ j=1,\ldots,M$ and $ M$ is the number of parameters. We then set each of these partial derivatives equal to zero. If we extend the definition of $ \mu$ such that $ \mu(\ensuremath{\mathbf{X}}, \ensuremath{\mathbf{\beta}}) = (\ensuremath{\mu(\...
..., \ensuremath{\mu(\ensuremath{\mathbf{x_{R}}}, \ensuremath{\mathbf{\beta}})})^T$, we may write the score equations as

$\displaystyle \ensuremath{\mathbf{X}}^T (\ensuremath{\mathbf{y}} - \mu(\ensuremath{\mathbf{X}}, \ensuremath{\mathbf{\beta}})) = 0$ (4.12)

Define $ \ensuremath{\mathbf{h}}(\ensuremath{\mathbf{\beta}})$ as the vector of partial derivatives shown in Equation 4.11. Finding a solution to the LR score equations is equivalent to finding the zeros of $ \ensuremath{\mathbf{h}}$. Because the LR likelihood is convex [27], there will be only one minimum and hence exactly one zero for $ \ensuremath{\mathbf{h}}$. The Newton-Raphson algorithm finds this optimum through repeated linear approximations. After an initial guess $ \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_0$ is chosen, each Newton-Raphson iteration updates its guess for $ \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_k$ via the first-order approximation

$\displaystyle \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_{i+1} = \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i +$   J$\displaystyle _{\ensuremath{\mathbf{h}}}(\ensuremath{\hat{\ensuremath{\mathbf{\...
...^{-1} \ensuremath{\mathbf{h}}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i)$ (4.13)

where J$ _{\ensuremath{\mathbf{h}}}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}})$ is the Jacobian of $ \ensuremath{\mathbf{h}}$ evaluated as $ \hat{\ensuremath{\mathbf{\beta}}}$. The Jacobian is a matrix, and each row is simply the transposed gradient of one component of $ \ensuremath{\mathbf{h}}$ evaluated at $ \hat{\ensuremath{\mathbf{\beta}}}$. Hence the $ ij$-element of J$ _{\ensuremath{\mathbf{h}}}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}})$ is $ -\sum_{i=1}^R x_{ij} x_{ik} \ensuremath{\mu(\ensuremath{\mathbf{x_{i}}}, \ensu...
... (1-\ensuremath{\mu(\ensuremath{\mathbf{x_{i}}}, \ensuremath{\mathbf{\beta}})})$. If we define $ w_i = \ensuremath{\mu(\ensuremath{\mathbf{x_{i}}}, \ensuremath{\mathbf{\beta}})} (1-\ensuremath{\mu(\ensuremath{\mathbf{x_{i}}}, \ensuremath{\mathbf{\beta}})})$ and $ \ensuremath{\mathbf{W}} =$   diag$ {(w_1, \ldots,
w_R)}$, then the Jacobian may be written in matrix notation as $ -\ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}} \ensuremath{\mathbf{X}}$. Using this fact and Equation 4.12 we may rewrite the Newton-Raphson update formula as

$\displaystyle \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_{i+1} = \ensuremat...
...- \mu(\ensuremath{\mathbf{X}}, \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}))$ (4.14)

Since $ \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i = (\ensuremath{\mathbf{X}}^T ...
...W}}
\ensuremath{\mathbf{X}}   \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i$ we may rewrite Equation 4.14 as
$\displaystyle \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_{i+1}$ $\displaystyle =$ $\displaystyle (\ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}} \ensuremath{\m...
... \mu(\ensuremath{\mathbf{X}}, \ensuremath{\hat{\ensuremath{\mathbf{\beta}}}})))$ (4.15)
  $\displaystyle =$ $\displaystyle (\ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}} \ensuremath{\m...
...^{-1} \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}} \ensuremath{\mathbf{z}}$ (4.16)

where $ \ensuremath{\mathbf{z}} = \ensuremath{\mathbf{X}} \ensuremath{\hat{\ensuremath...
...\mu(\ensuremath{\mathbf{X}},
\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i))$ [25,30,10]. The elements of vector $ \mathbf{z}$ are often called the adjusted dependent covariates, since we may view Equation 4.16 as the weighted least squares problem from Section 3.1 with dependent variables, or covariates, $ \mathbf{z}$. Newton-Raphson iterations may be stopped according to any of several criteria such as convergence of $ \hat{\ensuremath{\mathbf{\beta}}}$ or the LR log-likelihood from Equation 4.8.

Solving Equation 4.16 required a matrix inversion or Cholesky back-substitution, as well as several matrix vector multiplications. As a result, IRLS as described has time complexity O$ \left( M^3 + MR \right)$. We have commented several times in this thesis that O$ \left( M^3 \right)$ is unacceptably slow for our purposes. A simple alternative to the computations of Equation 4.16 will be presented in Chapter 5.


next up previous contents
Next: 4.4 Logistic Regression for Up: 4. Logistic Regression Previous: 4.2 Maximum Likelihood Estimation   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu