next up previous contents
Next: Stability Parameter Tables Up: 5.2.1 Indirect (IRLS) Stability Previous: 5.2.1 Indirect (IRLS) Stability   Contents Stability Parameter Motivation

The earliest problems we encountered in our implementations were floating point underflow and overflow. Because the datasets that concern us often have ten-thousand or more attributes, the discriminant $ \eta = \beta_0 + \beta_1 x_1 + \ldots + \beta_M x_M$ can easily grow large. When computing $ \exp(\eta)$ in the LR expectation function of Equation 4.1, a discriminant greater than 709 is sufficient to overflow IEEE 754 floating point arithmetic, the standard for modern computer hardware. The logistic function $ \mu = \exp(\eta) / (1+\exp(\eta))$ becomes indistinguishable from 1 when the discriminant reaches 37. IEEE 754 denormalized floating point values allow discriminants as low as -745 to be distinguished from zero; however, denormalized values have a decreased mantissa which compromises accuracy. When the logit reaches 0 or 1, the weights $ w = \mu (1-\mu)$ become 0 and $ \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}} \ensuremath{\mathbf{X}}$ is guaranteed to be positive semidefinite. Worse, the IRLS adjusted dependent covariates $ z = \eta + (y - \mu) / w$ become undefined. [11]

A further problem arises for binary outputs. Because the logit function approaches but does not reach the output values, it is possible that model parameters are adjusted unreasonably high or low. This behavior causes scaling problems because not all parameters will necessarily grow together. McIntosh [26] solved a similar problem by removing rows whose Binomial outcomes were saturated at 100%. This technique is not appropriate for Bernoulli outcomes that are identically one or zero.

We will explore several approaches to solving the stability challenges inherent in IRLS computations on binary datasets. We will introduce our stability parameters in the following paragraphs, with in-depth test results for each presented further below. The stability parameters are modelmin, modelmax, wmargin, margin, binitmean, rrlambda, cgwindow and cgdecay.

The modelmin and modelmax parameters are used to threshold the logistic function above and below, guaranteeing $ \mu$ will always be distinguishable from zero and one. The wmargin parameter imposes a similar symmetric thresholding for the weights. The margin parameter shrinks the outputs toward one-half by the specified amount, allowing them to fall within the range achievable by the logit. Since non-binary outputs can be interpreted in LR as the observed mean of a binomial experiment, the margin parameter can be viewed as adding uncertainty to each experiment's outcome. All of these parameters have the unfortunate side-effect of changing the shape of the optimization surface. In particular, the modelmin and modelmax parameters destroy convexity and create a plane once $ \ensuremath{\mathbf{\beta}}$ becomes large.

Normally the LR parameter vector $ \beta_0$ is initialized to zero before IRLS iterations begin. McIntosh [26] reports that initializing the offset $ \beta_0$ component to the mean of the outcomes $ \ensuremath{\mathbf{y}}$ eliminated overflow errors during CG-MLE computations. To test whether this technique affects IRLS we have the binitmean flag. Also motivated by literature is the standard regularization technique of ridge-regression, discussed in Section 3.2. The ridge-regression weight $ \lambda$ is set using the rrlambda parameter. Larger rrlambda values impose a greater penalty on oversized estimates $ \hat{\ensuremath{\mathbf{\beta}}}$.

All of the previous parameters attempt to control the behavior of the model and optimizer. The two remaining stability parameters exist to stop optimization when all else fails. If cgwindow consecutive iterations of CG fail to improve the deviance, CG iterations are terminated. Similarly if a CG iteration increases the deviance by more than the parameter cgdecay times the best previous deviance, then CG iterations are terminated.

Though not evaluated in this section, we cannot avoid mentioning the termination parameters lreps and cgeps. A full description of these will wait until Section 5.2.2. In short these parameters are used to control when IRLS quits searching for the optimal LR parameter vector $ \mathbf{beta}$. The smaller these values are, the more accurate our estimate of $ \mathbf{beta}$. Smaller values also result in more IRLS and CG iterations. On the other hand if lreps and cgeps are set too small and we are not using any regularization, then we may find an LR parameter estimate $ \hat{\ensuremath{\mathbf{\beta}}}$ which allows perfect prediction of the training data, including any noisy or incorrect outcomes. This will result in poor prediction performance. One further note regarding the CG termination parameter cgeps. It is compared to the Euclidean norm of the CG residual vector $ \ensuremath{\mathbf{r}}_i$, as defined in Section 2.1.2. Because this norm is affected by the dimensionality of the dataset, we originally scaled cgeps by the number of attributes. This scaling is not optimal, and will be discussed in detail in Section 5.2.2.

Tuning all of these stability parameters is unreasonable. The following analysis will examine the impact of each of these on six datasets, ultimately ascertaining which are beneficial and whether reasonable default values can be identified.

next up previous contents
Next: Stability Parameter Tables Up: 5.2.1 Indirect (IRLS) Stability Previous: 5.2.1 Indirect (IRLS) Stability   Contents
Copyright 2004 Paul Komarek,