# Logistic regression for binary classification

 Logistic regression (LR) is a statistical model suitable for probabilistic binary classification, though it is not limited to this use. Computing the best logistic model for a given dataset can be difficult, and is the subject of continuing research. Because LR is a mature and well-known model, we will not describe it in detail on this webpage. This tutorial appears reasonable, and more tutorials can be found via google. This page is organized as follows: The rest of this webpage assumes some knowledge of logistic regression.

## LR-TRIRLS (our software)

LR-TRIRLS stands for Logistic Regression with Truncated Regularized Iteratively Re-weighted Least Squares. This is our contribution to LR computation. The most up-to-date software is the GPL'd source and binary release. More links to related software and research are included below.

LR-TRIRLS uses IRLS to learn the LR parameters, with some modifications. IRLS is a quasi-Newton method used for fitting Generalized Linear Models (GLiMs) to data. It is an iterative method, solving a weighted least squares (WLS) linear regression problem on each iteration. These WLS subproblems require solving a linear system of equations, which done naively and exactly is slow, unstable, and prone to overfitting (see our AISTAT 2003 paper, linked above). We approximate the solution to the WLS subproblems using linear conjugate gradient (CG). This provides a significant acceleration, improves stability, and reduces overfitting through early termination. To further reduce overfitting, ridge regularization is used when solving the WLS subproblems.

Note that applying linear CG to the WLS subproblems is very different from using nonlinear CG to maximize the LR likelihood. Nonlinear CG is more complicated to understand, harder to code, and requires heuristic line searches, direction updates, and restarts. Empirically, we have found that LR-TRIRLS performs better than nonlinear CG applied to the LR likelihood. See the thesis or technical report linked above for details.

When IRLS is applied to a GLiM using a canonical link function, then IRLS is equivalent to Newton's method since the expected value of the Hessian is the Hessian itself. LR is one example of a GLiM with a canonical link. In numerical analysis, a Newton algorithm which approximates the Newton update is called a truncated Newton method. Convergence guarantees are available for truncated Newton methods (see the tech report linked above). In IRLS, the WLS subproblems are comparable to a Newton update. Because we approximate the WLS solutions using linear CG, we refer to this combination as Truncated IRLS.

Logistic regression means many things to many people. I have made a list of important points about LR in the scope of our research.

• You can use LR to model the relation between many input variables and a binary output variable. The LR model represents the probability that a data point belongs to the "positive" class (an arbitrary label for one of the two classes in the context of binary classification).
• The time required to learn the best LR model for your data can vary greatly depending on the software used, and whether you need a simple binary classifier or an explanatory model.
• Unless you carefully choose the attributes (input variables) in your dataset, the resulting LR model cannot be trusted to properly model the relation between those attributes.
• Our research focuses on large binary classification tasks, where learning an explanatory LR model is not possible.
• Our fast LR software does not compute the inverse covariance matrix, and hence we cannot compute variances on our estimates of the LR model parameters. However, bootstrapping can provide reasonable estimates of parameter variance, yet still cost less time than using the LR fitting procedures in many statistical software packages.
• There is more than one way to extend LR to support multiclass classification, instead of binary classification. Furthermore, with fast LR software it is often reasonable to use it in large multiclass problems.
• Though LR is a linear classifier, in that it produces a linear decision boundary. This does not seem to be a problem for high-dimensional classification. The best explanation is that the average distance between points typically increases with dimensionality, allowing room for good linear decision boundaries.

Software:

1. GPL'd source version, with pre-compiled GNU/Linux and Windows x86 binaries: [personal page] [Auton page]
2. Older binary-only version, with GUI, for GNU/Linux and Windows on x86, for sparse binary data only: [Auton page]
3. Older binary-only version, with GUI, for GNU/Linux and Windows on x86, for dense real-valued data only: [Auton page]

Papers:

1. Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity [Auton page]
2. Logistic Regression for Data Mining and High-Dimensional Classification (doctoral thesis) [Auton page] [personal page] [CMU RI tech report page]
3. Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs [Auton page]

Invited Talks:

1. Logistic regression for fast, accurate, and parameter free data mining, 2005 at Google, Inc. [personal page]
2. Autonomous Fast Classifiers for Pharmaceutical Data Sets, 2004 Midwest Biopharmaceutical Workshop (MBSW) and also 2004 at Applied Biosystems, Inc. [personal page]

## Logistic regression extensions

Multiclass extensions

A multiclass problem is one with more than two output classes. LR returns a single value P, the probability the data point belongs to the positive class. The probability of Not(positive) is 1-P. When you have three classes, this is not enough information to decide which output class is the most likely.

The simplest way to extend a probabilistic binary classifier to multiclass problems is though voting. The binary classifier is run once for each output class, and the probabilities recorded for each dataset record. After all of the probabilities are collected, then for each row you assign the output class which has the highest probability. Suppose b_j is the LR parameter vector learned for class j. Let n_ij be the dot product between b_j and the x_i, where x_i is the ith row from the dataset. In voting, we calculate the ith row's probability of belonging to class j as p_ij = n_ij / (1+n_ij), and the predicted class for x_i is argmax_j p_ij.

For logistic regression, a more sophisticated multiclass adaptation is avilable. As with voting, one model is learned for each output class. However, instead of comparing the usual probabilities for each model, a modified probability is calculated which combines all the models. Using the terminology from the previous paragraph, and assuming we have C output classes, p_ij = n_ij / sum(n_ik, k=1..C). Again, the class for x_i is predicted to be argmax_j p_ij.

Bootstrapping to compute parameter variance
[not yet written, search on "bootstrap variance"]
Row-similarity
[not yet written, think all pairs of rows]

## Logistic regression applications

2. Temporal friend finder
3. Alias detection (MNOP)
4. Activity from demographics and links
5. CosmoQuiz
6. Video Segmentation

When a binary classifier can handle millions of attributes and records, it can be used in surprising places. For instance, a discrete time series can be exploded into variables encapsulating the termination state and several preceding states, or whether certain transitions occurred. If you can find a way to featurize your binary classification problem, you can apply a fast binary classifier to it.

Binary classifiers can also be extended in a generic way to multiple output classes. Besides the generic way (voting), logistic regression can be modified in a multiclass-aware way for multiple output classes.

If a classifier is fast enough, and the data has only thousands of records, you can afford to compute similarity of records by creation

Probabilistic classifiers, such as logistic regression, can be used for more than just classification. Because the output predictions are probabilities, they can be ordered. This feature, in conjunction with multiclass extensions, can be used for collaborative filtering (see below).

I have made an informal list of interesting LR applications below. We have published papers on some of these applications, but not all of them. Nearly all of them have software available, through a webpage or a special request by email. To download software from links that say [Auton page], you may be asked to register for a username and password. Most of the applications come from the Auton lab and its members. When multiple software packages or research papers are listed, the first item listed is probably the easiest to read, download, or use (my opinion).

Collaborative filtering

Collaborative filtering is exemplified by Amazon.com's recommender system. Statements such as "people who bought cereal also bought milk", are typical results of collaborative filtering systems. It is helpful to think of this task backwards. Start with a complete collection of items (e.g. shopping cart at checkout), then remove one item. The goal of collaborative filtering is figure out which item was removed from the shopping cart. Alternatively, collaborative filters rank all items not in the collection (e.g. foods not in the shopping cart) in decreasing order of probability of belonging in the cart. The recommended item is that which has highest probability of being in the final collection, given the current items in the collection.

Thinking of collaborative filtering backwards makes it clear that it is the same as the link deletion task, which we describe further below. We will defer our explanation of how to use LR for collaborative filtering until the link deletion task section.

Software:

Can be arranged via email

Papers:

See notes on collaborative filtering, below. This task is equivalent to link completion.

1. Link deletion is a very simple link task. Assume that we are given a training dataset in which a particular entity E is removed from any links that contained it. The training data's outputs indicates whether or not E was removed from each link. The goal is to predict whether entity E should have been present for each link in the testing data.

To use LR, we need will convert the link data into sparse binary training and testing datasets. Number all of the link entities in the training data E0,...,Ec, except for the deleted entity E. The sparse training dataset has one row for each training link, and these rows are binary vectors which is 1 in the kth position if Ek is in the corresponding training link. We can train LR on this new dataset, and the resulting model will tell us the probability that entity E was deleted from a link. The testing data is converted in the same way to a sparse binary dataset in exactly the same way as the training data, so that we can apply our LR model directly to the testing links.

2. Link completion is a task of identifying which entities are missing from a given link, based on previous observations of complete links. It can be thought of as a multiclass extension to link deletion. It is equivalent to collaborative filtering, described above. Assume we are given training data with complete links, and testing data with incomplete links.

To use LR for link completion, we create several sparse binary training datasets, each with a different output. Each of these datasets is the same as the link deletion dataset we described above, where one entity has been moved from the inputs to the output. Thus there is one sparse binary dataset Dj for each entity Ej. We can learn an LR model b_j for each dataset Dj. For each row in the testing data, and for each entity Ej not in this row, we use the LR model b_j to predict the probability that Ej is missing from the link.

Software:

Can be arranged via email

Papers:

1. Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion (equivalent to collaborative filtering) [Auton page]
2. This simple link analsyis task is covered in our LR-TRIRLS papers, linked above. It is best described in my thesis, and briefly described in a recent tech report.
3. See software and links for TFF, MNOP, and AFDL, below.
Temporal friend finder (TFF): an example of using time series with LR

The temporal friend finder (TFF) algorithm falls under the category of link analysis. Suppose we are given a dataset where each link is an academic paper from the year 2000, and the entities are the authors of the paper. Suppose we also have such datasets for 2002, 2003, and 2004. We want to use all of these dated links to determine the probability that entity Ei will publish a paper with entity Ej in the year 2005.

To use LR for TFF, we capture the temporal aspect of our data by creating features such as

• Ei published one paper with Ej 1 time unit ago
• Ei published one paper with Ej 2 time units ago
• Ei published two papers with Ej 1 time unit ago
• Ei published two papers with Ej 2 time units ago
• Ei published one paper with someone that published with Ej 1 time unit ago
• Ei published one paper with someone that published with Ej 2 time units ago
• Ei published two papers with someone that published with Ej 1 time unit ago
• Ei published two papers with someone that published with Ej 2 time units ago
• ...

Because our LR code can handle data with millions of attributes reasonably quickly, we can afford to make such extensive featurizations of the temporal link data. All that is left is to create an output column indicating whether or not Ei published with Ej in 2005 for training the LR classifier. This can be repeated for all pairs of authors, if desired.

Software:

Can be arranged via email

Papers:

1. TFF was covered briefly in my 2005 invited talk at Google, Inc. The primary link for these slides is in the LR-TRIRLS papers section, above.
2. Detailed slides may be available through email
Alias detection (MNOP)

Some datasets use multiple names to represent a single entity. For example, some cake recipes might suggest decorating with sprinkles. Other cake recipes might suggest decorating with jimmies. However, sprinkles and jimmies are exactly the same. When analyzing data, this is called "aliasing" and can cause problems.

There are many methods of detecting aliases in a dataset. The "Many Names, One Person" (MNOP) project from Paul Hsiung and Andrew Moore uses logistic regression for this task. Software, documentation, and research papers are available. See the links listed earlier on this webpage.

In short, MNOP computes many orthographic measures between possible aliases, such as string edit distance, as well as several semantic measures such as co-occurance with a third term. A binary classifier is used to make predictions based on these measures, and LR was found emprically to work better than other classifiers.

Software:

1. Binary-only version, with GUI, for GNU/Linux and Windows on x86: [Auton Page]

Papers:

1. Alias Detection in Link Data Sets (conference paper) [Auton page]
2. Alias Detection in Link Data Sets (Paul Hsiung's master's thesis) [Auton page]
Activity from demographics and links: an example of graph featurization for LR

In this task, we are trying to identify "active" entities from graph-like link information. Unlike TFF, there is no time component, but we now have demographic information besides entity links. A graph is created from the entity links, and a variety of graph properties are converted to binary features. For example, there might be a feature "The entity in question is two edges away from an active entity". Finally, we might also know that E0 lives in Walla Walla, WA, while E1 lives in Zelinople, PA. All of this information is put into a sparse binary dataset, just as with TFF.

As an example, suppose we have a publication dataset, where entities are authors, and for each author we know their birth date, hair color, and favorite food. We define an active author is one that writes about DNA sequencing, and we know a few authors that are active. We want to identify which other authors might also write about DNA sequencing. We will create a new dataset with one row for each entity. That row will include binary versions of the birth date, hair color, and favorite food. For example, there will be 10 food attributes named "favorite food is ice cream", "favorite food is asparagus", etc. Additionally, there will be binary graph features such as "this entity is one step away from an active entity". More features can also be induced. We can use LR to learn a model that transforms this big collection of features into a probability that an entity is active.

Software:

1. Binary-only version, with GUI, for GNU/Linux and Windows on x86: [Auton Page]

Papers:

1. Program documentation (registration not required) [Auton page]
2. AFDL was covered briefly in my 2005 invited talk at Google, Inc. The primary link for these slides is in the LR-TRIRLS papers section, above.
3. Detailed slides may be available through email
CosmoQuiz

[description not yet written]

Software:

Can be arranged via email

Papers:

A research paper is in progress. Until it is finished, please see these slides for the 2005 invited talk at Google, Inc.

Video segmentation
[not written yet]