Convert list to set complexity
Earlier [Sets as Collective Data] we introduced sets. Recall that the elements of a set have no specific order, and ignore duplicates.If these ideas are not familiar, please read Sets as Collective Data, since they will be important when discussing the representation of sets. At that time we relied on Pyrets built-in representation of sets. Now we will discuss how to build sets for ourselves. In what follows, we will focus only on sets of numbers. We will start by discussing how to represent sets using lists. Intuitively, using lists to represent sets of data seems problematic, because lists respect both order and duplication. For instance, check: [list: 1, 2, 3] is [list: 3, 2, 1, 1] end In principle, we want sets to obey the following interface:
We may also find it also useful to have functions such as insert-many :: (List, Set -> Set) which, combined with mt-set, easily gives us a to-set function. Sets can contain many kinds of values, but not necessarily any kind: we need to be able to check for two values being equal (which is a requirement for a set, but not for a list!), which cant be done with all values [REF]; and sometimes we might even want the elements to obey an ordering [Converting Values to Ordered Values]. Numbers satisfy both characteristics. 15.1Representing Sets by ListsIn what follows we will see multiple different representations of sets, so we will want names to tell them apart. Well use LSet to stand for sets represented as lists. As a starting point, lets consider the implementation of sets using lists as the underlying representation. After all, a set appears to merely be a list wherein we ignore the order of elements. 15.1.1Representation ChoicesThe empty list can stand in for the empty set type LSet = List mt-set = empty and we can presumably define size as fun size(s :: LSet) -> Number: s.length() end However, this reduction (of sets to lists) can be dangerous:
To avoid these perils, we have to be precise about how were going to use lists to represent sets. One key question (but not the only one, as well soon see [REF]) is what to do about duplicates. One possibility is for insert to check whether an element is already in the set and, if so, leave the representation unchanged; this incurs a cost during insertion but avoids unnecessary duplication and lets us use length to implement size. The other option is to define insert as linkliterally, and have some other procedure perform the filtering of duplicates. 15.1.2Time ComplexityWhat is the complexity of this representation of sets? Lets consider just insert, check, and size. Suppose the size of the set is \(k\) (where, to avoid ambiguity, we let \(k\) represent the number of distinct elements). The complexity of these operations depends on whether or not we store duplicates:
One implementation of size is fun size(s :: LSet) -> Number: cases (List) s: | empty => 0 | link(f, r) => if r.member(f): size(r) else: 1 + size(r) end end end Lets now compute the complexity of the body of the function, assuming the number of distinct elements in s is \(k\) but the actual number of elements in s is \(d\), where \(d \geq k\). To compute the time to run size on \(d\) elements, \(T(d)\), we should determine the number of operations in each question and answer. The first question has a constant number of operations, and the first answer also a constant. The second question also has a constant number of operations. Its answer is a conditional, whose first question (r.member(f) needs to traverse the entire list, and hence has \(O([k -> d])\) operations. If it succeeds, we recur on something of size \(T(d-1)\); else we do the same but perform a constant more operations. Thus \(T(0)\) is a constant, while the recurrence (in big-Oh terms) is \begin{equation*}T(d) = d + T(d-1)\end{equation*} Thus \(T \in O([d \rightarrow d^2])\). Note that this is quadratic in the number of elements in the list, which may be much bigger than the size of the set. 15.1.3Choosing Between RepresentationsNow that we have two representations with different complexities, its worth thinking about how to choose between them. To do so, lets build up the following table. The table distinguishes between the interface (the set) and the implementation (the list), becauseowing to duplicates in the representationthese two may not be the same. In the table well consider just two of the most common operations, insertion and membership checking:
A naive reading of this would suggest that the representation with duplicates is better because its sometimes constant and sometimes linear, whereas the version without duplicates is always linear. However, this masks a very important distinction: what the linear means. When there are no duplicates, the size of the list is the same as the size of the set. However, with duplicates, the size of the list can be arbitrarily larger than that of the set! Based on this, we can draw several lessons:
Moreover, there is no reason a program should use only one representation. It could well begin with one representation, then switch to another as it better understands its workload. The only thing it would need to do to switch is to convert all existing data between the representations. How might this play out above? Observe that data conversion is very cheap in one direction: since every list without duplicates is automatically also a list with (potential) duplicates, converting in that direction is trivial (the representation stays unchanged, only its interpretation changes). The other direction is harder: we have to filter duplicates (which takes time quadratic in the number of elements in the list). Thus, a program can make an initial guess about its workload and pick a representation accordingly, but maintain statistics as it runs and, when it finds its assumption is wrong, switch representationsand can do so as many times as needed. 15.1.4Other Operations
With lists the first element is well-defined, whereas sets are defined to have no ordering. Indeed, just to make sure users of your sets dont accidentally assume anything about your implementation (e.g., if you implement one using first, they may notice that one always returns the element most recently added to the list), you really ought to return a random element of the set on each invocation. Unfortunately, returning a random element means the above interface is unusable. Suppose s is bound to a set containing 1, 2, and 3. Say the first time one(s) is invoked it returns 2, and the second time 1. (This already means one is not a functionan issue well get to elsewhere [REF].) The third time it may again return 2. Thus others has to remember which element was returned the last time one was called, and return the set sans that element. Suppose we now invoke one on the result of calling others. That means we might have a situation where one(s) produces the same result as one(others(s)).
15.2Making Sets Grow on TreesLets start by noting that it seems better, if at all possible, to avoid storing duplicates. Duplicates are only problematic during insertion due to the need for a membership test. But if we can make membership testing cheap, then we would be better off using it to check for duplicates and storing only one instance of each value (which also saves us space). Thus, lets try to improve the time complexity of membership testing (and, hopefully, of other operations too). It seems clear that with a (duplicate-free) list representation of a set, we cannot really beat linear time for membership checking. This is because at each step, we can eliminate only one element from contention which in the worst case requires a linear amount of work to examine the whole set. Instead, we need to eliminate many more elements with each comparisonmore than just a constant. In our handy set of recurrences (Solving Recurrences), one stands out: \(T(k) = T(k/2) + c\). It says that if, with a constant amount of work we can eliminate half the input, we can perform membership checking in logarithmic time. This will be our goal. Before we proceed, its worth putting logarithmic growth in perspective. Asymptotically, logarithmic is obviously not as nice as constant. However, logarithmic growth is very pleasant because it grows so slowly. For instance, if an input doubles from size \(k\) to \(2k\), its logarithmand hence resource usagegrows only by \(\log 2k - \log k = \log 2\), which is a constant. Indeed, for just about all problems, practically speaking the logarithm of the input size is bounded by a constant (that isnt even very large). Therefore, in practice, for many programs, if we can shrink our resource consumption to logarithmic growth, its probably time to move on and focus on improving some other part of the system. 15.2.1Converting Values to Ordered ValuesWe have actually just made an extremely subtle assumption. When we check one element for membership and eliminate it, we have eliminated only one element. To eliminate more than one element, we need one element to speak for several. That is, eliminating that one value needs to have safely eliminated several others as well without their having to be consulted. In particular, then, we can no longer compare for mere equality, which compares one set element against another element; we need a comparison that compares against an element against a set of elements. To do this, we have to convert an arbitrary datum into a datatype that permits such comparison. This is known as hashing. A hash function consumes an arbitrary value and produces a comparable representation of it (its hash)most commonly (but not strictly necessarily), a number. A hash function must naturally be deterministic: a fixed value should always yield the same hash (otherwise, we might conclude that an element in the set is not actually in it, etc.). Particular uses may need additional properties: e.g., below we assume its output is partially ordered. Let us now consider how one can compute hashes. If the input datatype is a number, it can serve as its own hash. Comparison simply uses numeric comparison (e.g., <). Then, transitivity of < ensures that if an element \(A\) is less than another element \(B\), then \(A\) is also less than all the other elements bigger than \(B\). The same principle applies if the datatype is a string, using string inequality comparison. But what if we are handed more complex datatypes? Before we answer that, consider that in practice numbers are more efficient to compare than strings (since comparing two numbers is very nearly constant time). Thus, although we could use strings directly, it may be convenient to find a numeric representation of strings. In both cases, we will convert each character of the string into a numbere.g., by considering its ASCII encoding. Based on that, here are two hash functions:
The first representation is invertible, using the Fundamental Theorem of Arithmetic: given the resulting number, we can reconstruct the input unambiguously (i.e., 16200000 can only map to the input above, and none other). The second encoding is, of course, not invertible (e.g., simply permute the characters and, by commutativity, the sum will be the same). Now let us consider more general datatypes. The principle of hashing will be similar. If we have a datatype with several variants, we can use a numeric tag to represent the variants: e.g., the primes will give us invertible tags. For each field of a record, we need an ordering of the fields (e.g., lexicographic, or alphabetical order), and must hash their contents recursively; having done so, we get in effect a string of numbers, which we have shown how to handle. Now that we have understood how one can deterministically convert any arbitrary datum into a number, in what follows, we will assume that the trees representing sets are trees of numbers. However, it is worth considering what we really need out of a hash. In Set Membership by Hashing Redux, we will not need partial ordering. Invertibility is more tricky. In what follows below, we have assumed that finding a hash is tantamount to finding the set element itself, which is not true if multiple values can have the same hash. In that case, the easiest thing to do is to store alongside the hash all the values that hashed to it, and we must search through all of these values to find our desired element. Unfortunately, this does mean that in an especially perverse situation, the desired logarithmic complexity will actually be linear complexity after all! In real systems, hashes of values are typically computed by the programming language implementation. This has the virtue that they can often be made unique. How does the system achieve this? Easy: it essentially uses the memory address of a value as its hash. (Well, not so fast! Sometimes the memory system can and does move values around ((part "garbage-collection")). In these cases computing a hash value is more complicated.) 15.2.2Using Binary TreesBecause logs come from trees. Clearly, a list representation does not let us eliminate half the elements with a constant amount of work; instead, we need a tree. Thus we define a binary tree of (for simplicity) numbers: data BT: | leaf | node(v :: Number, l :: BT, r :: BT) end Given this definition, lets define the membership checker: fun is-in-bt(e :: Number, s :: BT) -> Boolean: cases (BT) s: | leaf => false | node(v, l, r) => if e == v: true else: is-in-bt(e, l) or is-in-bt(e, r) end end end Oh, wait. If the element were looking for isnt the root, what do we do? It could be in the left child or it could be in the right; we wont know for sure until weve examined both. Thus, we cant throw away half the elements; the only one we can dispose of is the value at the root. Furthermore, this property holds at every level of the tree. Thus, membership checking needs to examine the entire tree, and we still have complexity linear in the size of the set. How can we improve on this? The comparison needs to help us eliminate not only the root but also one whole sub-tree. We can only do this if the comparison speaks for an entire sub-tree. It can do so if all elements in one sub-tree are less than or equal to the root value, and all elements in the other sub-tree are greater than or equal to it. Of course, we have to be consistent about which side contains which subset; it is conventional to put the smaller elements to the left and the bigger ones to the right. This refines our binary tree definition to give us a binary search tree (BST).
Its not. To actually throw away half the tree, we need to be sure that everything in the left sub-tree is less than the value in the root and similarly, everything in the right sub-tree is greater than the root.We have used <= instead of < above because even though we dont want to permit duplicates when representing sets, in other cases we might not want to be so stringent; this way we can reuse the above implementation for other purposes. But the definition above performs only a shallow comparison. Thus we could have a root a with a right child, b, such that b > a; and the b node could have a left child c such that c < b; but this does not guarantee that c > a. In fact, it is easy to construct a counter-example that passes this check: check: node(5, node(3, leaf, node(6, leaf, leaf)), leaf) satisfies is-a-bst-buggy # FALSE! end With a corrected definition, we can now define a refined version of binary trees that are search trees: We can also remind ourselves that the purpose of this exercise was to define sets, and define TSets to be tree sets: type TSet = BST mt-set = leaf Now lets implement our operations on the BST representation. First well write a template: fun is-in(e :: Number, s :: BST) -> Bool: cases (BST) s: | leaf => ... | node(v, l :: BST, r :: BST) => ... ... is-in(l) ... ... is-in(r) ... end end Observe that the data definition of a BST gives us rich information about the two children: they are each a BST, so we know their elements obey the ordering property. We can use this to define the actual operations: fun is-in(e :: Number, s :: BST) -> Boolean: cases (BST) s: | leaf => false | node(v, l, r) => if e == v: true else if e < v: is-in(e, l) else if e > v: is-in(e, r) end end end fun insert(e :: Number, s :: BST) -> BST: cases (BST) s: | leaf => node(e, leaf, leaf) | node(v, l, r) => if e == v: s else if e < v: node(v, insert(e, l), r) else if e > v: node(v, l, insert(e, r)) end end end In both functions we are strictly assuming the invariant of the BST, and in the latter case also ensuring it. Make sure you identify where, why, and how. You should now be able to define the remaining operations. Of these, size clearly requires linear time (since it has to count all the elements), but because is-in and insert both throw away one of two children each time they recur, they take logarithmic time.
But wait a minute. Are we actually done? Our recurrence takes the form \(T(k) = T(k/2) + c\), but what in our data definition guaranteed that the size of the child traversed by is-in will be half the size?
Imagine starting with the empty tree and inserting the values 1, 2, 3, and 4, in order. The resulting tree would be check: insert(4, insert(3, insert(2, insert(1, mt-set)))) is node(1, leaf, node(2, leaf, node(3, leaf, node(4, leaf, leaf)))) end Searching for 4 in this tree would have to examine all the set elements in the tree. In other words, this binary search tree is degenerateit is effectively a list, and we are back to having the same complexity we had earlier. Therefore, using a binary tree, and even a BST, does not guarantee the complexity we want: it does only if our inputs have arrived in just the right order. However, we cannot assume any input ordering; instead, we would like an implementation that works in all cases. Thus, we must find a way to ensure that the tree is always balanced, so each recursive call in is-in really does throw away half the elements. 15.2.3A Fine Balance: Tree SurgeryLets define a balanced binary search tree (BBST). It must obviously be a search tree, so lets focus on the balanced part. We have to be careful about precisely what this means: we cant simply expect both sides to be of equal size because this demands that the tree (and hence the set) have an even number of elements and, even more stringently, to have a size that is a power of two.
Therefore, we relax the notion of balance to one that is both accommodating and sufficient. We use the term balance factor for a node to refer to the height of its left child minus the height of its right child (where the height is the depth, in edges, of the deepest node). We allow every node of a BBST to have a balance factor of \(-1\), \(0\), or \(1\) (but nothing else): that is, either both have the same height, or the left or the right can be one taller. Note that this is a recursive property, but it applies at all levels, so the imbalance cannot accumulate making the whole tree arbitrarily imbalanced.
Here is an obvious but useful observation: every BBST is also a BST (this was true by the very definition of a BBST). Why does this matter? It means that a function that operates on a BST can just as well be applied to a BBST without any loss of correctness. So far, so easy. All that leaves is a means of creating a BBST, because its responsible for ensuring balance. Its easy to see that the constant empty-set is a BBST value. So that leaves only insert. Here is our situation with insert. Assuming we start with a BBST, we can determine in logarithmic time whether the element is already in the tree and, if so, ignore it.To implement a bag we count how many of each element are in it, which does not affect the trees height. When inserting an element, given balanced trees, the insert for a BST takes only a logarithmic amount of time to perform the insertion. Thus, if performing the insertion does not affect the trees balance, were done. Therefore, we only need to consider cases where performing the insertion throws off the balance. Observe that because \(<\) and \(>\) are symmetric (likewise with \(<=\) and \(>=\)), we can consider insertions into one half of the tree and a symmetric argument handles insertions into the other half. Thus, suppose we have a tree that is currently balanced into which we are inserting the element \(e\). Lets say \(e\) is going into the left sub-tree and, by virtue of being inserted, will cause the entire tree to become imbalanced.Some trees, like family trees [REF], represent real-world data. It makes no sense to balance a family tree: it must accurately model whatever reality it represents. These set-representing trees, in contrast, are chosen by us, not dictated by some external reality, so we are free to rearrange them. There are two ways to proceed. One is to consider all the places where we might insert \(e\) in a way that causes an imbalance and determine what to do in each case.
The number of cases is actually quite overwhelming (if you didnt think so, you missed a few...). Therefore, we instead attack the problem after it has occurred: allow the existing BST insert to insert the element, assume that we have an imbalanced tree, and show how to restore its balance.The insight that a tree can be made self-balancing is quite remarkable, and there are now many solutions to this problem. This particular one, one of the oldest, is due to G.M. Adelson-Velskii and E.M. Landis. In honor of their initials it is called an AVL Tree, though the tree itself is quite evident; their genius is in defining re-balancing. Thus, in what follows, we begin with a tree that is balanced; insert causes it to become imbalanced; we have assumed that the insertion happened in the left sub-tree. In particular, suppose a (sub-)tree has a balance factor of \(2\) (positive because were assuming the left is imbalanced by insertion). The procedure for restoring balance depends critically on the following property:
The algorithm that follows is applied as insert returns from its recursion, i.e., on the path from the inserted value back to the root. Since this path is of logarithmic length in the sets size (due to the balancing property), and (as we shall see) performs only a constant amount of work at each step, it ensures that insertion also takes only logarithmic time, thus completing our challenge. To visualize the algorithm, lets use this tree schematic: Here, \(p\) is the value of the element at the root (though we will also abuse terminology and use the value at a root to refer to that whole tree), \(q\) is the value at the root of the left sub-tree (so \(q < p\)), and \(A\), \(B\), and \(C\) name the respective sub-trees. We have assumed that \(e\) is being inserted into the left sub-tree, which means \(e < p\). Lets say that \(C\) is of height \(k\). Before insertion, the tree rooted at \(q\) must have had height \(k+1\) (or else one insertion cannot create imbalance). In turn, this means \(A\) must have had height \(k\) or \(k-1\), and likewise for \(B\). Suppose that after insertion, the tree rooted at \(q\) has height \(k+2\). Thus, either \(A\) or \(B\) has height \(k+1\) and the other must have height less than that (either \(k\) or \(k-1\)).
This gives us two cases to consider. 15.2.3.1Left-Left CaseLets say the imbalance is in \(A\), i.e., it has height \(k+1\). Lets expand that tree: We know the following about the data in the sub-trees. Well use the notation \(T < a\) where \(T\) is a tree and \(a\) is a single value to mean every value in \(T\) is less than \(a\).
Lets also remind ourselves of the sizes:
Imagine this tree is a mobile, which has gotten a little skewed to the left. You would naturally think to suspend the mobile a little further to the left to bring it back into balance. That is effectively what we will do: Observe that this preserves each of the ordering properties above. In addition, the \(A\) subtree has been brought one level closer to the root than earlier relative to \(B\) and \(C\). This restores the balance (as you can see if you work out the heights of each of \(A_i\), \(B\), and \(C\)). Thus, we have also restored balance. 15.2.3.2Left-Right CaseThe imbalance might instead be in \(B\). Expanding: Again, lets record what we know about data order:
We therefore have to somehow bring \(B_1\) and \(B_2\) one level closer to the root of the tree. By using the above data ordering knowledge, we can construct this tree: Of course, if \(B_1\) is the problematic sub-tree, this still does not address the problem. However, we are now back to the previous (left-left) case; rotating gets us to: Now observe that we have precisely maintained the data ordering constraints. Furthermore, from the root, \(A\)s lowest node is at height \(k+1\) or \(k+2\); so is \(B_1\)s; so is \(B_2\)s; and \(C\)s is at \(k+2\). 15.2.3.3Any Other Cases?Were we a little too glib before? In the left-right case we said that only one of \(B_1\) or \(B_2\) could be of height \(k\) (after insertion); the other had to be of height \(k-1\). Actually, all we can say for sure is that the other has to be at most height \(k-2\).
|