Academics Papers Data Resume


This page is divided into two sections. The first section holds the dataset table, and the second section is a description of the various dataset file formats the datasets use. To save disk space and network bandwidth, datasets on this page are losslessly compressed using the popular bzip2 software. Under any reasonable unix-ish system, the datafiles can be decompressed with the bzip2 -d or bunzip2 commands.

Note that the filenames do not necessarily match the dataset names. Please pay attention to the filename when downloading, to avoid "losing" the file on your computer.

Dataset Table

Dataset Format Notes Rows Atts Nonzero
The ds1.10 dataset is a compressed life sciences dataset. Each row of the original, expanded dataset represented a chemistry or biology experiment, and the output represented the reactivity of the compound observed in the experiment. We ran principal components analysis (PCA) on that dataset to create ds1.10, saving only the top ten principal components. The rightmost attribute is a binarized version of the original output. This dataset seems to trigger strange behavior in the SVMlight and LIBSVM support vector machine software. 26,733 10 267,330
The ds1.100 dataset is a compressed life sciences dataset similar to ds1.10, above. The only difference is the number of principal components used, 100. This dataset does not trigger crazy behavior in the SVMlight and LIBSVM support vector machine software, though the score curve is not as smooth in the parameters as one would like it to be. 26,733 100 2,673,300
imdb spardat This is a link dataset built with permission from the Internet Movie Data (IMDB). Each row is a film or television program. Each attribute represents an actors, directors, etc. In a given row, there is a 1 (one) for every person associated with that row (i.e. film or television program), and a 0 (zero) for every person not associated with that row. The data file is itself stored in a sparse format, so don't expect a giant CSV matrix. The output is 1 (one) if Mel Blanc, voice of Bugs Bunny and other cartoon characters, was involved in the film or television program. Mel Blanc was chosen as the output because he appeared in more films or television programs than any other person in the database, at the time of compilation. Note, Mel Blanc is not among input attributes. 167,773 685,569 2,442,721
citeseer spardat This is a link dataset built with permission from the CiteSeer web database. Each row is a scientific paper. Each attribute represents an author. In a given row, there is a 1 (one) for every author associated with that row (i.e. paper), and a 0 (zero) for every author not associated with that row. The data file is itself stored in a sparse format, so don't expect a giant CSV matrix. The output is 1 (one) if author J. Lee was involved in the paper. We expect that there are several authors in the CiteSeer database with the name J. Lee. J. Lee was chosen as the output because he or she appeared in more papers than any other person in the database, at the time of compilation. Note, J. Lee is not among input attributes. 181,395 105,354 512,267
training factors,
training activations,
testing factors,
testing activations
June My version of the Modified Apte training data from the Reuters-21578 corpus, often used in text classification experiments. It was created to duplicate Yiming Yang's version of the same data, but ended up somewhat different. The "factors" file contains the inputs (word occurances), and the "activations" file contains the outputs (class labels). You should download both files. 7,769 26,299 423,025

Dataset Format Descriptions

CSV (jump to dataset table)
All rows in a CSV dataset file are comma-separated lists of real values. In some cases, the first line has a list of the input and output attribute names, and the the second line should be blank. All remaining lines are treated as records. Unless otherwise specified, the usual output attribute is the right-most attribute. The records can have numeric or symbolic values. Integers are treated as symbolic values. Sometimes we will call a whitespace-delimited file a CSV file, since the only difference is the delimiter. Example file:
          a,b,c,d, e
          yes, 0, 1, 0.0, success
          yes, 1, 0, 0.3, failure
          no,  1, 1, 0.0, success
FDS (jump to dataset table)
FDS is a binary version of CSV files. The first line shows the format name, and the second line shows the attribute names. There is one more line for each attribute, explaining whether the attribute is real or symbolic. If the attribute is symbolic, the attribute's list of symbols is displayed. The numbers of rows is listed on the first line following the attribute discriptions. After this, the rest of the file contains a mostly-binary dump of the data. Example file (based on the CSV example file):
          AUTONFastFileFormat Version 1
          a b c d e
          symbolic  a values: yes no
          symbolic  b values: 0 1
          real  c
          symbolic  d values: success failure
          rows = 4539 
          [mostly-binary data]
spardat (jump to dataset table)
The spardat format is only capable of representing binary datasets with real outputs. The dataset is designed for sparse data, and is inefficient for dense data. Though the output may be a real number, the spardat loader we use binarizes the output with a user-supplied threshold. This format is whitespace-delimited. Each line starts with the real output value, followed by a (whitespace-delimited) list of attribute which have value 1 (one) for that dataset row. The attributes are listed according to their index, starting from 0 (zero). The dataset is assumed to have as many attributes as necessary to accomodate the highest-numbered attribute that appears in any row. However, there is no requirement that lower-numbered attributes appear anywhere. For compatibility with some software, such as SVMlight, the attribute indices maybe be followed with ":1". Lines beginning with "#" are ignored. Example file with 8 attributes, mixing the standard attribute index format with the ":1" version:
          # The first line uses the :1 format.
          1.000000 0:1 3:1 7:1
          # The rest of the lines use the standard format.  It is
          # unusual to mix standard and :1 formats in the same file.
          0.000000 1 2 5 6
          1.414214 0
June (jump to dataset table)
The June format splits the input attributes and the output attributes into two files. The format of the inputs file is identical to the spardat format, except there is no output specified at the beginning of the each line. The outputs file is a CSV file, though whitespace-delimited columns are allowed. The first line of the outputs file must contain the attributes names. The second may be blank. Each remaining line holds one or more output values for the corresponding input line. For algorithms which need only one output value, the software will allow specification of the output column name. Output columns in the output file must contain only the symbols Act (short for Active) and Not_Act. The Act output is generally assumed to represent the positive class in a binary classification. The outputs in a June-formatted dataset are necessarily binary. If a column in the output file is never used as an output column, it can contain anything. The inputs are sometimes called the "factors", and the outputs are sometimes called the "activations". Example input file:
          1 3 7
          1 2 5 6
Example output file, using whitespace for the delimiter, containing two columns (Experiment1 and Experiment2) eligible for use as output columns:
          RowID       Experiment1 Experiment2
          0000:green  Act         Act
          0001:blue   Act         Not_Act
          0002:red    Act         Act
Academics Papers Data Resume

Up to Academics Home (
Created by Paul Komarek,