| Academics | Papers | Data | Resume |
This page is divided into two sections. The first section holds
the dataset table, and the second section is a description of
the various dataset file formats the datasets use. To save disk
space and network bandwidth, datasets on this page are
losslessly compressed using the popular bzip2 software.
Under any reasonable unix-ish system, the datafiles can be
decompressed with the bzip2 -d or
bunzip2 commands.
Note that the filenames do not necessarily match the dataset names. Please pay attention to the filename when downloading, to avoid "losing" the file on your computer.
| Dataset | Format | Notes | Rows | Atts | Nonzero |
|---|---|---|---|---|---|
|
ds1.10 ds1.10 |
CSV FDS |
The ds1.10 dataset is a compressed
life sciences dataset. Each row of the original, expanded
dataset represented a chemistry or biology experiment, and
the output represented the reactivity of the compound
observed in the experiment. We ran principal components
analysis (PCA) on that dataset to create ds1.10, saving only the top ten principal
components. The rightmost attribute is a binarized version
of the original output. This dataset seems to trigger
strange behavior in the SVMlight
and LIBSVM
support vector machine software.
|
26,733 | 10 | 267,330 |
|
ds1.100 ds1.100 |
CSV FDS |
The ds1.100 dataset is a compressed
life sciences dataset similar to ds1.10,
above. The only difference is the number of principal components
used, 100. This dataset does not trigger crazy behavior in
the SVMlight
and LIBSVM
support vector machine software, though the score curve is not
as smooth in the parameters as one would like it to be.
|
26,733 | 100 | 2,673,300 |
| imdb | spardat | This is a link dataset built with permission from the Internet Movie Data (IMDB). Each row is a film or television program. Each attribute represents an actors, directors, etc. In a given row, there is a 1 (one) for every person associated with that row (i.e. film or television program), and a 0 (zero) for every person not associated with that row. The data file is itself stored in a sparse format, so don't expect a giant CSV matrix. The output is 1 (one) if Mel Blanc, voice of Bugs Bunny and other cartoon characters, was involved in the film or television program. Mel Blanc was chosen as the output because he appeared in more films or television programs than any other person in the database, at the time of compilation. Note, Mel Blanc is not among input attributes. | 167,773 | 685,569 | 2,442,721 |
| citeseer | spardat | This is a link dataset built with permission from the CiteSeer web database. Each row is a scientific paper. Each attribute represents an author. In a given row, there is a 1 (one) for every author associated with that row (i.e. paper), and a 0 (zero) for every author not associated with that row. The data file is itself stored in a sparse format, so don't expect a giant CSV matrix. The output is 1 (one) if author J. Lee was involved in the paper. We expect that there are several authors in the CiteSeer database with the name J. Lee. J. Lee was chosen as the output because he or she appeared in more papers than any other person in the database, at the time of compilation. Note, J. Lee is not among input attributes. | 181,395 | 105,354 | 512,267 |
|
modapte: training factors, training activations, testing factors, testing activations |
June | My version of the Modified Apte training data from the Reuters-21578 corpus, often used in text classification experiments. It was created to duplicate Yiming Yang's version of the same data, but ended up somewhat different. The "factors" file contains the inputs (word occurances), and the "activations" file contains the outputs (class labels). You should download both files. | 7,769 | 26,299 | 423,025 |
a,b,c,d, e
yes, 0, 1, 0.0, success
yes, 1, 0, 0.3, failure
no, 1, 1, 0.0, success
...
AUTONFastFileFormat Version 1
a b c d e
symbolic a values: yes no
symbolic b values: 0 1
real c
symbolic d values: success failure
rows = 4539
[mostly-binary data]
...
# The first line uses the :1 format.
1.000000 0:1 3:1 7:1
# The rest of the lines use the standard format. It is
# unusual to mix standard and :1 formats in the same file.
0.000000 1 2 5 6
1.414214 0
...
Act (short for Active) and Not_Act.
The Act output is generally assumed to represent
the positive class in a binary classification. The outputs in
a June-formatted dataset are necessarily binary. If a column
in the output file is never used as an output column, it can
contain anything. The inputs are sometimes called the
"factors", and the outputs are sometimes called the
"activations". Example input file:
1 3 7
1 2 5 6
0
...
Example output file, using whitespace for the delimiter, containing
two columns (Experiment1 and Experiment2) eligible for use as
output columns:
RowID Experiment1 Experiment2
0000:green Act Act
0001:blue Act Not_Act
0002:red Act Act
...
| Academics | Papers | Data | Resume |
| Up to Academics | Home (komarix.org) |
| Created by Paul Komarek, komarek.paul@gmail.com |