Logistic regression (LR) is a statistical model suitable for probabilistic binary classification, though it is not limited to this use. Computing the best logistic model for a given dataset can be difficult, and is the subject of continuing research. Because LR is a mature and well-known model, we will not describe it in detail on this webpage. This tutorial appears reasonable, and more tutorials can be found via google. This page is organized as follows:
LR-TRIRLS stands for Logistic Regression with Truncated Regularized Iteratively Re-weighted Least Squares. This is our contribution to LR computation. The most up-to-date software is the GPL'd source and binary release. More links to related software and research are included below.
LR-TRIRLS uses IRLS to learn the LR parameters, with some modifications. IRLS is a quasi-Newton method used for fitting Generalized Linear Models (GLiMs) to data. It is an iterative method, solving a weighted least squares (WLS) linear regression problem on each iteration. These WLS subproblems require solving a linear system of equations, which done naively and exactly is slow, unstable, and prone to overfitting (see our AISTAT 2003 paper, linked above). We approximate the solution to the WLS subproblems using linear conjugate gradient (CG). This provides a significant acceleration, improves stability, and reduces overfitting through early termination. To further reduce overfitting, ridge regularization is used when solving the WLS subproblems.
Note that applying linear CG to the WLS subproblems is very different from using nonlinear CG to maximize the LR likelihood. Nonlinear CG is more complicated to understand, harder to code, and requires heuristic line searches, direction updates, and restarts. Empirically, we have found that LR-TRIRLS performs better than nonlinear CG applied to the LR likelihood. See the thesis or technical report linked above for details.
When IRLS is applied to a GLiM using a canonical link function, then IRLS is equivalent to Newton's method since the expected value of the Hessian is the Hessian itself. LR is one example of a GLiM with a canonical link. In numerical analysis, a Newton algorithm which approximates the Newton update is called a truncated Newton method. Convergence guarantees are available for truncated Newton methods (see the tech report linked above). In IRLS, the WLS subproblems are comparable to a Newton update. Because we approximate the WLS solutions using linear CG, we refer to this combination as Truncated IRLS.
Logistic regression means many things to many people. I have made a list of important points about LR in the scope of our research.
A multiclass problem is one with more than two output classes. LR returns a single value P, the probability the data point belongs to the positive class. The probability of Not(positive) is 1-P. When you have three classes, this is not enough information to decide which output class is the most likely.
The simplest way to extend a probabilistic binary
classifier to multiclass problems is though
For logistic regression, a more sophisticated multiclass adaptation is avilable. As with voting, one model is learned for each output class. However, instead of comparing the usual probabilities for each model, a modified probability is calculated which combines all the models. Using the terminology from the previous paragraph, and assuming we have C output classes, p_ij = n_ij / sum(n_ik, k=1..C). Again, the class for x_i is predicted to be argmax_j p_ij.
When a binary classifier can handle millions of attributes and records, it can be used in surprising places. For instance, a discrete time series can be exploded into variables encapsulating the termination state and several preceding states, or whether certain transitions occurred. If you can find a way to featurize your binary classification problem, you can apply a fast binary classifier to it.
Binary classifiers can also be extended in a generic way to multiple output classes. Besides the generic way (voting), logistic regression can be modified in a multiclass-aware way for multiple output classes.
If a classifier is fast enough, and the data has only thousands of records, you can afford to compute similarity of records by creation
Probabilistic classifiers, such as logistic regression, can be used for more than just classification. Because the output predictions are probabilities, they can be ordered. This feature, in conjunction with multiclass extensions, can be used for collaborative filtering (see below).
I have made an informal list of interesting LR applications below. We have published papers on some of these applications, but not all of them. Nearly all of them have software available, through a webpage or a special request by email. To download software from links that say [Auton page], you may be asked to register for a username and password. Most of the applications come from the Auton lab and its members. When multiple software packages or research papers are listed, the first item listed is probably the easiest to read, download, or use (my opinion).
Collaborative filtering is exemplified by Amazon.com's recommender system. Statements such as "people who bought cereal also bought milk", are typical results of collaborative filtering systems. It is helpful to think of this task backwards. Start with a complete collection of items (e.g. shopping cart at checkout), then remove one item. The goal of collaborative filtering is figure out which item was removed from the shopping cart. Alternatively, collaborative filters rank all items not in the collection (e.g. foods not in the shopping cart) in decreasing order of probability of belonging in the cart. The recommended item is that which has highest probability of being in the final collection, given the current items in the collection.
Thinking of collaborative filtering backwards makes it clear that it is the same as the link deletion task, which we describe further below. We will defer our explanation of how to use LR for collaborative filtering until the link deletion task section.
Can be arranged via email
Link deletion is a very simple link task. Assume that we are given a training dataset in which a particular entity E is removed from any links that contained it. The training data's outputs indicates whether or not E was removed from each link. The goal is to predict whether entity E should have been present for each link in the testing data.
To use LR, we need will convert the link data into sparse binary training and testing datasets. Number all of the link entities in the training data E0,...,Ec, except for the deleted entity E. The sparse training dataset has one row for each training link, and these rows are binary vectors which is 1 in the kth position if Ek is in the corresponding training link. We can train LR on this new dataset, and the resulting model will tell us the probability that entity E was deleted from a link. The testing data is converted in the same way to a sparse binary dataset in exactly the same way as the training data, so that we can apply our LR model directly to the testing links.
Link completion is a task of identifying which entities are missing from a given link, based on previous observations of complete links. It can be thought of as a multiclass extension to link deletion. It is equivalent to collaborative filtering, described above. Assume we are given training data with complete links, and testing data with incomplete links.
To use LR for link completion, we create several sparse binary training datasets, each with a different output. Each of these datasets is the same as the link deletion dataset we described above, where one entity has been moved from the inputs to the output. Thus there is one sparse binary dataset Dj for each entity Ej. We can learn an LR model b_j for each dataset Dj. For each row in the testing data, and for each entity Ej not in this row, we use the LR model b_j to predict the probability that Ej is missing from the link.
Can be arranged via email
The temporal friend finder (TFF) algorithm falls under the category of link analysis. Suppose we are given a dataset where each link is an academic paper from the year 2000, and the entities are the authors of the paper. Suppose we also have such datasets for 2002, 2003, and 2004. We want to use all of these dated links to determine the probability that entity Ei will publish a paper with entity Ej in the year 2005.
To use LR for TFF, we capture the temporal aspect of our data by creating features such as
Because our LR code can handle data with millions of attributes reasonably quickly, we can afford to make such extensive featurizations of the temporal link data. All that is left is to create an output column indicating whether or not Ei published with Ej in 2005 for training the LR classifier. This can be repeated for all pairs of authors, if desired.
Can be arranged via email
Some datasets use multiple names to represent a single
entity. For example, some cake recipes might suggest decorating
There are many methods of detecting aliases in a dataset. The "Many Names, One Person" (MNOP) project from Paul Hsiung and Andrew Moore uses logistic regression for this task. Software, documentation, and research papers are available. See the links listed earlier on this webpage.
In short, MNOP computes many orthographic measures between possible aliases, such as string edit distance, as well as several semantic measures such as co-occurance with a third term. A binary classifier is used to make predictions based on these measures, and LR was found emprically to work better than other classifiers.
In this task, we are trying to identify "active" entities from graph-like link information. Unlike TFF, there is no time component, but we now have demographic information besides entity links. A graph is created from the entity links, and a variety of graph properties are converted to binary features. For example, there might be a feature "The entity in question is two edges away from an active entity". Finally, we might also know that E0 lives in Walla Walla, WA, while E1 lives in Zelinople, PA. All of this information is put into a sparse binary dataset, just as with TFF.
As an example, suppose we have a publication dataset, where entities are authors, and for each author we know their birth date, hair color, and favorite food. We define an active author is one that writes about DNA sequencing, and we know a few authors that are active. We want to identify which other authors might also write about DNA sequencing. We will create a new dataset with one row for each entity. That row will include binary versions of the birth date, hair color, and favorite food. For example, there will be 10 food attributes named "favorite food is ice cream", "favorite food is asparagus", etc. Additionally, there will be binary graph features such as "this entity is one step away from an active entity". More features can also be induced. We can use LR to learn a model that transforms this big collection of features into a probability that an entity is active.
[description not yet written]
Can be arranged via email
A research paper is in progress. Until it is finished, please see these slides for the 2005 invited talk at Google, Inc.