next up previous contents
Next: 4.2 Maximum Likelihood Estimation Up: 4. Logistic Regression Previous: 4. Logistic Regression   Contents

4.1 Logistic Model

Let $ \ensuremath{\mathbf{X}},\ensuremath{\mathbf{y}}$ be a dataset with binary outcomes. For each experiment $ \ensuremath{\mathbf{x_i}}$ in $ \ensuremath{\mathbf{X}}$ the outcome is either $ y_i = 1$ or $ y_i = 0$. Experiments with outcome $ y_i = 1$ are said to belong to the positive class, while experiments with $ y_i = 0$ belong to the negative class. We wish to create a regression model which allows classification of an experiment $ \ensuremath{\mathbf{x_i}}$ as positive or negative, that is, belonging to either the positive or negative class. Though LR is applicable to datasets with outcomes in $ [0,1]$, we will restrict our discussion to the binary case.

Figure 4.1: Logistic function with one attribute $ x$.
\includegraphics[width=\textwidth]{figures/logistic_model.ps}
Figure 4.2: The logistic model's binomial error variance changes with the mean $ \mu (x)$.
\includegraphics[width=\textwidth]{figures/logistic_model_error.ps}

We can think of an experiment in $ \mathbf{X}$ as a Bernoulli trial with mean parameter $ \mu(\ensuremath{\mathbf{x_i}})$. Thus $ y_i$ is a Bernoulli random variable with mean $ \mu(\ensuremath{\mathbf{x_i}})$ and variance $ \mu(x_i) (1-\mu(x_i))$. It is important to note that the variance of $ y_i$ depends on the mean and hence on the experiment $ \mathbf{x_i}$. To model the relation between each experiment $ \mathbf{x_i}$ and the expected value of its outcome, we will use the logistic function. This function is written as

$\displaystyle \mu(\ensuremath{\mathbf{x}},\beta) = \frac{\exp( \ensuremath{\mat...
...\mathbf{x}})}{1 + \exp( \ensuremath{\mathbf{\beta}}^T \ensuremath{\mathbf{x}})}$ (4.1)

where $ \mathbf{beta}$ is the vector of parameters, and its shape may be seen in Figure 4.1. We assume that $ x_0 =
1$ so that $ \beta_0$ is a constant term, just as we did for linear regression in Section 3.1. Thus our regression model is

$\displaystyle y = \mu(\ensuremath{\mathbf{x}}, \beta) + \epsilon$ (4.2)

where $ \epsilon$ is our error term. It may be easily seen in Figure 4.2 that the error term $ \epsilon$ can have only one of two values. If $ y = 1$ then $ \epsilon =
1-\mu(\ensuremath{\mathbf{x}})$, otherwise $ \epsilon = \mu(\ensuremath{\mathbf{x}})$. Since $ y$ is Bernoulli with mean $ \mu(\ensuremath{\mathbf{x}})$ and variance $ \mu(\ensuremath{\mathbf{x}})
(1-\mu(\ensuremath{\mathbf{x}}))$, the error $ \epsilon$ has zero mean and variance $ \mu(\ensuremath{\mathbf{x}})
(1-\mu(\ensuremath{\mathbf{x}}))$. This is different than the linear regression case where the error and outcome had constant variance $ \sigma^2$ independent of the mean.

Because the LR model is nonlinear in $ \mathbf{beta}$, minimizing the RSS as defined for linear regression in Section 3.1 is not appropriate. Not only is it difficult to compute the minimum of the RSS, but the RSS minimizer will not correspond to maximizing the likelihood function. It is possible to transform the logistic model into one which is linear in its parameters using the logit function $ g(\mu)$, defined as

$\displaystyle g(\mu) = \frac{\mu}{1-\mu}$ (4.3)

We apply the logit to the outcome variable and expectation function of the original model in Equation 4.2 to create the new model

$\displaystyle g(y) = g(\mu(\ensuremath{\mathbf{x}})) + \tilde{\epsilon} = \ensu...
...f{\beta}}^T \ensuremath{\mathbf{x}} + \tilde{\epsilon}(\ensuremath{\mathbf{x}})$ (4.4)

However, we cannot use linear regression least squares techniques for this new model because the error $ \tilde{\epsilon}$ is not Normally distributed and the variance is not independent of the mean. One might further observe that this transformation is not well defined for $ y=0$ or $ y = 1$. Therefore we turn to parameter estimation methods such as maximum likelihood or iteratively re-weighted least squares.

It is important to notice that LR is a linear classifier, a result of the linear relation of the parameters and the components of the data. This indicates that LR can be thought of as finding a hyperplane to separate positive and negative data points. In high-dimensional spaces, the common wisdom is that linear separators are almost always adequate to separate the classes.


next up previous contents
Next: 4.2 Maximum Likelihood Estimation Up: 4. Logistic Regression Previous: 4. Logistic Regression   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu