Academics | Papers | Data | Resume |
AD-trees are data structures used to accelerate conjuntive counting queries on a dataset. Suppose you had a dataset about weather in Pittsburgh, PA, USA, where each dataset record represented one day and there were 365 records. You might want to ask the question "How many rainy days in the last year also had high temperatures?" This question can be written as the conjunctive counting query
COUNT( raining=TRUE AND temperature=high )Let's pretend the answer is 57. To find this answer, we scan the entire dataset for matching records.
Suppose now that you want to know how many rainy days there were. This is the query
COUNT( raining=TRUE )We know that the answer is at least 57, since that was the answer to a more specific query. To find the correct answer, we have to scan the dataset again. If we had been more clever, we would have cached the answer to
COUNT( raining=TRUE ) while finding the answer to COUNT( raining=TRUE AND temperature=high )
.
One more thing to note, before we say anything about AD-trees.
Suppose now that we wanted to compute COUNT(
raining=TRUE AND temperature != high ) )
. We don't
really need to scan the dataset again, because of the relation
COUNT( raining=TRUE AND temperature != high ) = COUNT( raining=TRUE ) - COUNT( raining=TRUE AND temperature=high )We will call this relation the "subtraction trick".
An AD-tree is a data structure which reads through the dataset once and pre-computes the answer to every possible counting query. There are two important questions to ask:
These questions are best answered in this easy-to-read paper. In short, AD-trees are trees with a special non-binary semi-sparse structure. Some of the nodes, the AD-nodes, represent queries and store the answer to the query. Decendents of these nodes represent specializations of that query. Though our example queries use exclusively two-valued attributes, AD-trees can handle attributes having any finite number of values.
AD-trees are tractable because they are sparse. The sparsity is a result of three tricks. The last trick listed below is the most important.
We should state that last trick of the previous section more precisely. This discussion is a bit technical and terse. For a more leisurely discussion, see the same easy-to-read paper linked above.
First we simplify our query notation by replacing
each AND
with a comma. Suppose we
have a generic query containing n
attributes, which we will write as COUNT(
a_1=v_1,...,a_n=v_n )
. Suppose the k
-th attribute a_k
takes value v_k
, where v_k
is between 1 and v_k_max
. Then the
following relationships hold:
COUNT( a_1=v_1,...,a_n=v_n ) = COUNT( a_1=v_1,...,a_(k-1)=v_(k-1), a_k=v_k, a_(k+1)=v_(k+1),..., a_n=v_n ) = COUNT( a_1=v_1,...,a_(k-1)=v_(k-1),a_(k+1)=v_(k+1),...,a_n=v_n ) - COUNT( a_1=v_1,...,a_k=1,...,a_n=v_n) ... - COUNT( a_1=v_1,...,a_k=(v_k)-1,...,a_n=v_n) - COUNT( a_1=v_1,...,a_k=(v_k)+1,...,a_n=v_n) ... - COUNT( a_1=v_1,...,a_k=v_k_max,...,a_n=v_n)Pay careful attention to the difference between the
(k+1)
-st value, written v_(k+1)
, and the k
-th value plus one, written (v_k)+1
. This statement is a
generalization of the subtraction trick, above. Instead
of a single subtraction, there is one subtraction for each
possible value for a_k
except
v_k
.
Using this subtraction trick, we can choose not to store certain
counts in the AD-tree. For the generic query given above, we
choose not to create an AD-node for the value v_k
which is most common among dataset rows
satisfying the earlier part of the query, a_1=v_1,...,a_(k-1)=v_(k-1)
. This removes
not just one AD-node, but also the subtree of nodes representing
specializations of this query.
There are many papers describing applications of AD-trees, and more than a few related to the AD-tree data structure and modifications. Below, I have listed several papers from the latter class. Undoubtedly I have left out many important papers. I hope to add to and maintain this list over time, and I welcome suggestions for other papers to include.
Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets .
1998 Auton page
This is, to some extent, the original AD-tree paper. It contains a nice description of what I later called Static AD-trees. This paper is short, easy-to-read, and the best place to start.
AD-trees for Fast Counting and for Fast Learning of Association Rules.
1998 Auton page
This paper describes the application of AD-trees to association rule learning. In this context, an association rule is a conjunction of attribute=value pairs, along with a special attribute=value pair called the "output". The goal of association rule learning is to find conjunctions which occur frequently in the dataset, and for which the output assignment is also true. Reading the paper will clear up this muddy description. In short, AD-trees are a terrific way to exhaustively search for general association rules.
There are many other algorithms and data structures for accelerating exhaustive or approximate rule-learning. For instance, the Apriori algorithm or Frequent Prefix trees (FP-trees). These methods often apply only to limited classes of rules. For instance, instead of allowing conjunctions arbitrary attribute=value pairs, the goal is to find frequent itemsets. Itemsets are capable of expressing inclusion of an attribute in a set of attributes. In conjunctive query terms, that means the attributes are all binary, and if an attribute is included in a query then its value must be 1. There are many interesting problems which meet these constraints, for example market basket analysis. In these cases, the generality of AD-trees is likely to slow them down or unnecessarily increase memory usage.
A Dynamic Adaptation of AD-trees for Efficient Machine Learning on Large Data Sets.
2000 My page
This paper describes modifications to regular Static AD-trees to allow automatic incorporation of dataset changes, as well as how to lazily create AD-tree node as needed to answer queries without losing efficiency. When memory is unlimited, as well as many limited-memory situation, Dynamic AD-trees are likely to outperform Static AD-trees. The principle weakness of Dynamic AD-trees is stochastically-generated queries when memory is limited. In this case, the caching structures used to maintain efficiency are mostly useless.
Besides the applications listed in this paper, we have been
able to use Dynamic AD-trees for association rule learning
on some very large datasets, including the life-sciences
dsa
dataset described in our logistic
regression papers, e.g. this
paper. Of course, dsa
is a large, sparse
binary dataset that is ideally suited to FP-trees and other
more restrictive algorithms and data-structures. Without
tuning, the Dynamic AD-tree ultimately required about 30GB
of ram. Still, we managed to exhaustively search for the
twenty-best association rules in a few hours.
AD+Tree: A Compact Adaptation of Dynamic AD-Trees for Efficient Machine Learning on Large Data Sets.
2003 [Springer page]
AD+Trees are a specialization of Dynamic AD-Trees. Some space efficiency is gained, especially for datasets with repeated rows, using a clever embedding of a kd-tree. This allows for smaller caches than used by Dynamic AD-Trees, which should improve performance when memory is limited. This link points into the Springer-Verlag site, to a book which contains this paper as a chapter. Please let me know if this link does not work for you, or if you know of a "regular" link to this paper.
I haven't really written this section. Until I do, the following links may be useful:
Academics | Papers | Data | Resume |
Up to Academics | Home (komarix.org) |
Created by Paul Komarek, komarek.paul@gmail.com |