Academics Papers Data Resume

Paul Komarek's Dynamic AD-tree Page

Introduction to AD-trees

AD-trees are data structures used to accelerate conjuntive counting queries on a dataset. Suppose you had a dataset about weather in Pittsburgh, PA, USA, where each dataset record represented one day and there were 365 records. You might want to ask the question "How many rainy days in the last year also had high temperatures?" This question can be written as the conjunctive counting query

        COUNT( raining=TRUE AND temperature=high )
      
Let's pretend the answer is 57. To find this answer, we scan the entire dataset for matching records.

Suppose now that you want to know how many rainy days there were. This is the query

        COUNT( raining=TRUE )
      
We know that the answer is at least 57, since that was the answer to a more specific query. To find the correct answer, we have to scan the dataset again. If we had been more clever, we would have cached the answer to COUNT( raining=TRUE ) while finding the answer to COUNT( raining=TRUE AND temperature=high ).

One more thing to note, before we say anything about AD-trees. Suppose now that we wanted to compute COUNT( raining=TRUE AND temperature != high ) ). We don't really need to scan the dataset again, because of the relation

        COUNT( raining=TRUE AND temperature != high )
          = COUNT( raining=TRUE ) - COUNT( raining=TRUE AND temperature=high )
      
We will call this relation the "subtraction trick".

An AD-tree is a data structure which reads through the dataset once and pre-computes the answer to every possible counting query. There are two important questions to ask:

  1. How does the AD-tree organize these counts?
  2. Won't this take a lot of time and space?

These questions are best answered in this easy-to-read paper. In short, AD-trees are trees with a special non-binary semi-sparse structure. Some of the nodes, the AD-nodes, represent queries and store the answer to the query. Decendents of these nodes represent specializations of that query. Though our example queries use exclusively two-valued attributes, AD-trees can handle attributes having any finite number of values.

AD-trees are tractable because they are sparse. The sparsity is a result of three tricks. The last trick listed below is the most important.

  1. The answer to many counting queries, and hence specializations of those queries, is zero.
  2. When the number of dataset rows satisfying a query becomes sufficiently small, we store the list of satisfying rows in the AD-node and make the node a leaf. The count for more specialized queries is computed on-demand by scanning these rows.
  3. We can compute the answers to some questions by using the answer to other already-answered questions. This implies that we can quickly answer any counting query, without storing the answer to every counting query.

The Subtraction Trick, Generalized

We should state that last trick of the previous section more precisely. This discussion is a bit technical and terse. For a more leisurely discussion, see the same easy-to-read paper linked above.

First we simplify our query notation by replacing each AND with a comma. Suppose we have a generic query containing n attributes, which we will write as COUNT( a_1=v_1,...,a_n=v_n ). Suppose the k-th attribute a_k takes value v_k, where v_k is between 1 and v_k_max. Then the following relationships hold:

        COUNT( a_1=v_1,...,a_n=v_n )
          = COUNT( a_1=v_1,...,a_(k-1)=v_(k-1), a_k=v_k, a_(k+1)=v_(k+1),..., a_n=v_n  )

          = COUNT( a_1=v_1,...,a_(k-1)=v_(k-1),a_(k+1)=v_(k+1),...,a_n=v_n )
            - COUNT( a_1=v_1,...,a_k=1,...,a_n=v_n)
              ...
            - COUNT( a_1=v_1,...,a_k=(v_k)-1,...,a_n=v_n)
            - COUNT( a_1=v_1,...,a_k=(v_k)+1,...,a_n=v_n)
              ...
            - COUNT( a_1=v_1,...,a_k=v_k_max,...,a_n=v_n)
      
Pay careful attention to the difference between the (k+1)-st value, written v_(k+1), and the k-th value plus one, written (v_k)+1. This statement is a generalization of the subtraction trick, above. Instead of a single subtraction, there is one subtraction for each possible value for a_k except v_k.

Using this subtraction trick, we can choose not to store certain counts in the AD-tree. For the generic query given above, we choose not to create an AD-node for the value v_k which is most common among dataset rows satisfying the earlier part of the query, a_1=v_1,...,a_(k-1)=v_(k-1). This removes not just one AD-node, but also the subtree of nodes representing specializations of this query.

AD-tree Papers

There are many papers describing applications of AD-trees, and more than a few related to the AD-tree data structure and modifications. Below, I have listed several papers from the latter class. Undoubtedly I have left out many important papers. I hope to add to and maintain this list over time, and I welcome suggestions for other papers to include.

AD-tree Software

I haven't really written this section. Until I do, the following links may be useful:

Academics Papers Data Resume

Up to Academics Home (komarix.org)
Created by Paul Komarek, komarek.paul@gmail.com