Let
be a dataset with binary outcomes. For each
experiment
in
the outcome is either or
. Experiments with outcome are said to belong to
the *positive class*, while experiments with belong to
the *negative class*. We wish to create a regression model which
allows *classification* of an experiment
as
*positive* or *negative*, that is, belonging to either the
positive or negative class. Though LR is applicable to datasets with
outcomes in , we will restrict our discussion to the binary
case.

We can think of an experiment in
as a Bernoulli trial with mean
parameter
. Thus is a Bernoulli random variable
with mean
and variance
. It is
important to note that the variance of depends on the mean and
hence on the experiment
. To model the relation between each
experiment
and the expected value of its outcome, we will use
the logistic function. This function is written as

(4.1) |

where
is the vector of parameters, and its shape may be seen
in Figure 4.1. We assume that so
that is a constant term, just as we did for linear
regression in Section 3.1. Thus our regression model is

(4.2) |

where is our error term. It may be easily seen in
Figure 4.2 that the error term can
have only one of two values. If then
, otherwise
. Since is
Bernoulli with mean
and variance
, the error has zero mean and variance
. This is different than the linear
regression case where the error and outcome had constant variance
independent of the mean.

Because the LR model is nonlinear in
, minimizing the RSS
as defined for linear regression in Section 3.1 is not
appropriate. Not only is it difficult to compute the minimum of the
RSS, but the RSS minimizer will not correspond to maximizing the
likelihood function. It is possible to transform the logistic model
into one which is linear in its parameters using the *logit*
function , defined as

(4.3) |

We apply the logit to the outcome variable and expectation function of
the original model in Equation 4.2 to create the
new model

(4.4) |

However, we cannot use linear regression least squares techniques for
this new model because the error
is not Normally
distributed and the variance is not independent of the mean. One
might further observe that this transformation is not well defined for
or . Therefore we turn to parameter estimation methods
such as maximum likelihood or *iteratively re-weighted least squares*.

It is important to notice that LR is a linear classifier, a result of
the linear relation of the parameters and the components of the data.
This indicates that LR can be thought of as finding a hyperplane to
separate positive and negative data points. In high-dimensional
spaces, the common wisdom is that linear separators are almost always
adequate to separate the classes.