Academics Papers Data Resume

# Paul Komarek's Dynamic AD-tree Page

## Introduction to AD-trees

AD-trees are data structures used to accelerate conjuntive counting queries on a dataset. Suppose you had a dataset about weather in Pittsburgh, PA, USA, where each dataset record represented one day and there were 365 records. You might want to ask the question "How many rainy days in the last year also had high temperatures?" This question can be written as the conjunctive counting query

```        COUNT( raining=TRUE AND temperature=high )
```
Let's pretend the answer is 57. To find this answer, we scan the entire dataset for matching records.

Suppose now that you want to know how many rainy days there were. This is the query

```        COUNT( raining=TRUE )
```
We know that the answer is at least 57, since that was the answer to a more specific query. To find the correct answer, we have to scan the dataset again. If we had been more clever, we would have cached the answer to ``` COUNT( raining=TRUE ) while finding the answer to COUNT( raining=TRUE AND temperature=high ). ```

One more thing to note, before we say anything about AD-trees. Suppose now that we wanted to compute ```COUNT( raining=TRUE AND temperature != high ) )```. We don't really need to scan the dataset again, because of the relation

```        COUNT( raining=TRUE AND temperature != high )
= COUNT( raining=TRUE ) - COUNT( raining=TRUE AND temperature=high )
```
We will call this relation the "subtraction trick".

An AD-tree is a data structure which reads through the dataset once and pre-computes the answer to every possible counting query. There are two important questions to ask:

1. How does the AD-tree organize these counts?
2. Won't this take a lot of time and space?

These questions are best answered in this easy-to-read paper. In short, AD-trees are trees with a special non-binary semi-sparse structure. Some of the nodes, the AD-nodes, represent queries and store the answer to the query. Decendents of these nodes represent specializations of that query. Though our example queries use exclusively two-valued attributes, AD-trees can handle attributes having any finite number of values.

AD-trees are tractable because they are sparse. The sparsity is a result of three tricks. The last trick listed below is the most important.

1. The answer to many counting queries, and hence specializations of those queries, is zero.
2. When the number of dataset rows satisfying a query becomes sufficiently small, we store the list of satisfying rows in the AD-node and make the node a leaf. The count for more specialized queries is computed on-demand by scanning these rows.
3. We can compute the answers to some questions by using the answer to other already-answered questions. This implies that we can quickly answer any counting query, without storing the answer to every counting query.

## The Subtraction Trick, Generalized

We should state that last trick of the previous section more precisely. This discussion is a bit technical and terse. For a more leisurely discussion, see the same easy-to-read paper linked above.

First we simplify our query notation by replacing each `AND` with a comma. Suppose we have a generic query containing `n` attributes, which we will write as ```COUNT( a_1=v_1,...,a_n=v_n )```. Suppose the `k`-th attribute `a_k` takes value `v_k`, where `v_k` is between 1 and `v_k_max`. Then the following relationships hold:

```        COUNT( a_1=v_1,...,a_n=v_n )
= COUNT( a_1=v_1,...,a_(k-1)=v_(k-1), a_k=v_k, a_(k+1)=v_(k+1),..., a_n=v_n  )

= COUNT( a_1=v_1,...,a_(k-1)=v_(k-1),a_(k+1)=v_(k+1),...,a_n=v_n )
- COUNT( a_1=v_1,...,a_k=1,...,a_n=v_n)
...
- COUNT( a_1=v_1,...,a_k=(v_k)-1,...,a_n=v_n)
- COUNT( a_1=v_1,...,a_k=(v_k)+1,...,a_n=v_n)
...
- COUNT( a_1=v_1,...,a_k=v_k_max,...,a_n=v_n)
```
Pay careful attention to the difference between the `(k+1)`-st value, written `v_(k+1)`, and the `k`-th value plus one, written `(v_k)+1`. This statement is a generalization of the subtraction trick, above. Instead of a single subtraction, there is one subtraction for each possible value for `a_k` except `v_k`.

Using this subtraction trick, we can choose not to store certain counts in the AD-tree. For the generic query given above, we choose not to create an AD-node for the value `v_k` which is most common among dataset rows satisfying the earlier part of the query, `a_1=v_1,...,a_(k-1)=v_(k-1)`. This removes not just one AD-node, but also the subtree of nodes representing specializations of this query.

## AD-tree Papers

There are many papers describing applications of AD-trees, and more than a few related to the AD-tree data structure and modifications. Below, I have listed several papers from the latter class. Undoubtedly I have left out many important papers. I hope to add to and maintain this list over time, and I welcome suggestions for other papers to include.

• Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets .

This is, to some extent, the original AD-tree paper. It contains a nice description of what I later called Static AD-trees. This paper is short, easy-to-read, and the best place to start.

• AD-trees for Fast Counting and for Fast Learning of Association Rules.

1998 Auton page

This paper describes the application of AD-trees to association rule learning. In this context, an association rule is a conjunction of attribute=value pairs, along with a special attribute=value pair called the "output". The goal of association rule learning is to find conjunctions which occur frequently in the dataset, and for which the output assignment is also true. Reading the paper will clear up this muddy description. In short, AD-trees are a terrific way to exhaustively search for general association rules.

There are many other algorithms and data structures for accelerating exhaustive or approximate rule-learning. For instance, the Apriori algorithm or Frequent Prefix trees (FP-trees). These methods often apply only to limited classes of rules. For instance, instead of allowing conjunctions arbitrary attribute=value pairs, the goal is to find frequent itemsets. Itemsets are capable of expressing inclusion of an attribute in a set of attributes. In conjunctive query terms, that means the attributes are all binary, and if an attribute is included in a query then its value must be 1. There are many interesting problems which meet these constraints, for example market basket analysis. In these cases, the generality of AD-trees is likely to slow them down or unnecessarily increase memory usage.

• A Dynamic Adaptation of AD-trees for Efficient Machine Learning on Large Data Sets.

2000 My page

This paper describes modifications to regular Static AD-trees to allow automatic incorporation of dataset changes, as well as how to lazily create AD-tree node as needed to answer queries without losing efficiency. When memory is unlimited, as well as many limited-memory situation, Dynamic AD-trees are likely to outperform Static AD-trees. The principle weakness of Dynamic AD-trees is stochastically-generated queries when memory is limited. In this case, the caching structures used to maintain efficiency are mostly useless.

Besides the applications listed in this paper, we have been able to use Dynamic AD-trees for association rule learning on some very large datasets, including the life-sciences `dsa` dataset described in our logistic regression papers, e.g. this paper. Of course, `dsa` is a large, sparse binary dataset that is ideally suited to FP-trees and other more restrictive algorithms and data-structures. Without tuning, the Dynamic AD-tree ultimately required about 30GB of ram. Still, we managed to exhaustively search for the twenty-best association rules in a few hours.

• AD+Tree: A Compact Adaptation of Dynamic AD-Trees for Efficient Machine Learning on Large Data Sets.

2003 [Springer page]

AD+Trees are a specialization of Dynamic AD-Trees. Some space efficiency is gained, especially for datasets with repeated rows, using a clever embedding of a kd-tree. This allows for smaller caches than used by Dynamic AD-Trees, which should improve performance when memory is limited. This link points into the Springer-Verlag site, to a book which contains this paper as a chapter. Please let me know if this link does not work for you, or if you know of a "regular" link to this paper.

## AD-tree Software

I haven't really written this section. Until I do, the following links may be useful:

• Some AD-tree software can be found at the Auton lab's software pages.
• For my Dynamic AD-tree software, or my python bindings to Dynamic AD-trees and python rule-learning software, please email me.
• If you have AD-tree and related software and want it included here, please email me.
 Academics Papers Data Resume

 Created by Paul Komarek, komarek.paul@gmail.com