5.2.1.1 Stability Parameter Motivation

The earliest problems we encountered in our implementations were floating point underflow and overflow. Because the datasets that concern us often have ten-thousand or more attributes, the discriminant can easily grow large. When computing in the LR expectation function of Equation 4.1, a discriminant greater than 709 is sufficient to overflow IEEE 754 floating point arithmetic, the standard for modern computer hardware. The logistic function becomes indistinguishable from 1 when the discriminant reaches 37. IEEE 754 denormalized floating point values allow discriminants as low as -745 to be distinguished from zero; however, denormalized values have a decreased mantissa which compromises accuracy. When the logit reaches 0 or 1, the weights become 0 and is guaranteed to be positive semidefinite. Worse, the IRLS adjusted dependent covariates become undefined. [11]

A further problem arises for binary outputs. Because the logit function approaches but does not reach the output values, it is possible that model parameters are adjusted unreasonably high or low. This behavior causes scaling problems because not all parameters will necessarily grow together. McIntosh [26] solved a similar problem by removing rows whose Binomial outcomes were saturated at 100%. This technique is not appropriate for Bernoulli outcomes that are identically one or zero.

We will explore several approaches to solving the stability challenges
inherent in IRLS computations on binary datasets. We will introduce
our stability parameters in the following paragraphs, with in-depth
test results for each presented further below. The stability
parameters are `modelmin`, `modelmax`, `wmargin`, `margin`, `binitmean`, `rrlambda`, `cgwindow` and
`cgdecay`.

The `modelmin` and `modelmax` parameters are used to threshold the
logistic function above and below, guaranteeing will always be
distinguishable from zero and one. The `wmargin` parameter imposes a
similar symmetric thresholding for the weights. The `margin` parameter
shrinks the outputs toward one-half by the specified amount, allowing
them to fall within the range achievable by the logit. Since
non-binary outputs can be interpreted in LR as the observed mean of a
binomial experiment, the `margin` parameter can be viewed as adding
uncertainty to each experiment's outcome. All of these parameters
have the unfortunate side-effect of changing the shape of the
optimization surface. In particular, the `modelmin` and `modelmax` parameters destroy convexity and create a plane once
becomes large.

Normally the LR parameter vector is initialized to zero
before IRLS iterations begin. McIntosh [26] reports that
initializing the offset component to the mean of the
outcomes
eliminated overflow errors during CG-MLE
computations. To test whether this technique affects IRLS we have the
`binitmean` flag. Also motivated by literature is the standard
regularization technique of ridge-regression, discussed in
Section 3.2. The ridge-regression weight
is set using the `rrlambda` parameter. Larger `rrlambda` values impose a
greater penalty on oversized estimates
.

All of the previous parameters attempt to control the behavior of the
model and optimizer. The two remaining stability parameters exist to
stop optimization when all else fails. If `cgwindow` consecutive
iterations of CG fail to improve the deviance, CG iterations are
terminated. Similarly if a CG iteration increases the deviance by
more than the parameter `cgdecay` times the best previous deviance, then
CG iterations are terminated.

Though not evaluated in this section, we cannot avoid mentioning the
termination parameters `lreps` and `cgeps`. A full description of
these will wait until Section 5.2.2. In short
these parameters are used to control when IRLS quits searching for the
optimal LR parameter vector
. The smaller these values are,
the more accurate our estimate of
. Smaller values also
result in more IRLS and CG iterations. On the other hand if `lreps` and `cgeps` are set too small and we are not using any regularization,
then we may find an LR parameter estimate
which allows
perfect prediction of the training data, including any noisy or
incorrect outcomes. This will result in poor prediction performance.
One further note regarding the CG termination parameter `cgeps`. It is
compared to the Euclidean norm of the CG residual vector
,
as defined in Section 2.1.2. Because this norm is
affected by the dimensionality of the dataset, we originally scaled
`cgeps` by the number of attributes. This scaling is not optimal, and
will be discussed in detail in Section 5.2.2.

Tuning all of these stability parameters is unreasonable. The following analysis will examine the impact of each of these on six datasets, ultimately ascertaining which are beneficial and whether reasonable default values can be identified.