next up previous contents
Next: 3 Logistic Regression For Up: 4. Logistic Regression Previous: 4.3 Iteratively Re-weighted Least   Contents


4.4 Logistic Regression for Classification

Among the strengths of LR is the interpretability of the parameter estimates. Because the expectation function $ \mu(\ensuremath{\mathbf{x}},\ensuremath{\mathbf{\beta}})$ is the mean of a Bernoulli random variable, $ \mu$ is a probability. A deeper analysis of fitted LR models is possible if the model's attributes were chosen carefully to avoid certain types of interactions and correlations. Hosmer and Lemeshow [13] and Myers et al. [30] explain in detail how to create LR models and measure the significance of a fitted LR model. Hosmer and Lemeshow [13] covers in depth the interpretation of a fitted model's coefficients in a wide variety of settings.

We are interested in LR for data-mining and high-dimensional classification. In these settings it is unlikely that the dataset's attributes were carefully chosen, or that one can sift through all combinations of hundreds-of-thousands of attributes to avoid harmful interactions and correlations. For this reason we must assume that our datasets will have correlated or even identical variables. Our experiment matrix $ \mathbf{X}$ will have linear dependencies and $ \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{X}}$ will only be positive semidefinite. The columns of $ \ensuremath{\mathbf{X}}$ will be badly scaled. The outcomes $ \ensuremath{\mathbf{y}}$ used for estimating $ \mathbf{beta}$ may have as many false positives as true positives. Furthermore, all of these shortcomings should be handled without human intervention.

For the reasons stated above, we will not explore model adequacy, significance or interpretability. Among the many interesting statistics available for LR model analysis, we are only interested in the loss function. This function will be part of the termination criterion for our parameter estimation methods described in Sections 4.2 and 4.3. Both LR and linear regression are members of a family of models called generalized linear models, and the loss functions for members of this family are defined as a scaled likelihood ratio known as the deviance. In particular, the deviance is defined as

DEV $\displaystyle \hspace{-0.7mm}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}) =...
...x}}} {\mathbb{L}_{\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}}} \right) \phi$ (4.17)

where $ \mathbb{L}_{\mbox{max}}$ is the maximum likelihood achievable, $ \mathbb{L}_{\mbox{\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}}}$ is the likelihood for the actual model and parameter estimate being used, and $ \phi$ is the dispersion parameter [25]. For the linear model in Equation 3.5 the dispersion parameter is defined as $ \sigma^2$, and for the logistic model of Equation 4.2 it is 1. Using these definitions we can compute the deviance for a linear model to be exactly the RSS. For logistic regression with binary outputs the deviance expands to
DEV $\displaystyle \hspace{-0.7mm}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}})$ $\displaystyle =$ $\displaystyle -2 \ln \mathbb{L}(\ensuremath{\mathbf{X}}, \ensuremath{\mathbf{y}}, \ensuremath{\mathbf{\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}}})$ (4.18)
  $\displaystyle =$ $\displaystyle -2 \sum_{i=1}^R \left( \rule{0mm}{10pt}
y_i \ln(\mu(\ensuremath{\...
...i) \ln(1 - \mu(\ensuremath{\mathbf{x_i}}, \ensuremath{\mathbf{\beta}}))
\right)$ (4.19)

[13]. Just as minimizing the RSS for linear regression was equivalent to maximizing the linear regression likelihood function, minimizing the LR deviance is equivalent to maximizing the LR likelihood function. For the remainder of this thesis, deviance should be understood to mean the LR deviance.

Since the deviance is just the negative log-likelihood, using either quantity to terminate iterative parameter estimation techniques will result in the same $ \hat{\ensuremath{\mathbf{\beta}}}$. Because deficiencies in the data matrix could cause the LR score equations shown in Equation 4.12 to have multiple solutions $ \hat{\ensuremath{\mathbf{\beta}}}$, waiting until $ \hat{\ensuremath{\mathbf{\beta}}}$ converges to terminate iterative methods may not be efficient. We prefer to terminate when the relative difference of the deviance $ ($DEV $ \hspace{-0.7mm}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i) -$   DEV $ \hspace{-0.7mm}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_{i+1})) /$   DEV $ \hspace{-0.7mm}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_{i+1})$ is sufficiently small. Determining what constitutes sufficiently small will be discussed later in this thesis.

LR is susceptible to the same parameter estimation problems caused by correlated attributes as is linear regression [13,27]. In Section 3.1 we discussed a technique called ridge regression to prevent overly large parameter estimates. A similar technique can be used when estimating $ \mathbf{beta}$ for LR. The simplest way to apply a ridge regression penalty to LR maximum likelihood estimation is to add it to the loss function, which is the deviance. This corresponds to subtracting it from the log-likelihood function, allowing generic convergence criteria to be applied for numerical techniques. When using IRLS for parameter estimation there is a weighted least squares linear regression subproblem to which traditional ridge regression penalties may be applied. Unfortunately this is not equivalent to performing the IRLS derivation of Section 4.3 with the penalty applied to the LR log-likelihood function. Instead of deriving a custom IRLS which incorporates the penalty in the likelihood function, we will use the simpler approach of using traditional ridge regression when solving the weighted least squares subproblem.

Now that we have defined a termination criterion and described methods of applying ridge regression, we may summarize our two methods of find the LR MLE. Algorithm 3 describes the direct maximum likelihood estimation approach of Section 4.2, while Algorithm 4 describes the IRLS approach of of Section 4.3. Lines 3.1 and 4.1 in these algorithms include the regularization methods described above. While the direct method appears much simpler, it can be difficult to find a good iterative method for the nonlinear objective function shown in line 3.1. Despite the additional lines of computation for IRLS, the weighted least squares subproblem in line 4.1 is simply a linear system of equations.

We will be discussing various implementations of LR throughout this thesis. We have created a tree-figure depicting the relationship of our methods and modifications. Figure 4.3 shows the beginning of this tree. Currently there is only a mention of LR and the nonlinear deviance minimization problem, followed by a choice of generic nonlinear methods on the right and IRLS on the left. Additional nodes will be added to this tree in in Chapter 5.

Figure 4.3: Tree showing relationships choices when implementing LR. This tree will be updated in later chapters when we describe LR implementation details or modifications.
\includegraphics{figures/treemap-lr.eps}

.0
\begin{algorithm}
% latex2html id marker 4440
[tbp] \SetKwInOut{Input}{input} \S...
...\hat{\ensuremath{\mathbf{\beta}}}}_{i+1})$ \;
$i := i+1$ \;
}\end{algorithm}

.0
\begin{algorithm}
% latex2html id marker 4451
[tbp] \SetKwInOut{Input}{input} \S...
...t{\ensuremath{\mathbf{\beta}}}}_{i+1})$ \;
$i := i+1$ \;
\par
}\end{algorithm}


next up previous contents
Next: 3 Logistic Regression For Up: 4. Logistic Regression Previous: 4.3 Iteratively Re-weighted Least   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu