Earlier [Sets as Collective Data] we introduced sets. Recall that the elements of a set have no specific order, and ignore duplicates.If these ideas are not familiar, please read Sets as Collective Data, since they will be important when discussing the representation of sets. At that time we relied on Pyrets built-in representation of sets. Now we will discuss how to build sets for ourselves. In what follows, we will focus only on sets of numbers.
We will start by discussing how to represent sets using lists. Intuitively, using lists to represent sets of data seems problematic, because lists respect both order and duplication. For instance,
check: [list: 1, 2, 3] is [list: 3, 2, 1, 1] end
In principle, we want sets to obey the following interface:
mt-set :: Set |
is-in :: [T, Set-> Bool] |
insert :: [T, Set-> Set] |
union :: [Set, Set-> Set] |
size :: [Set-> Number] |
to-list :: [Set-> List] |
We may also find it also useful to have functions such as
insert-many :: [List, Set -> Set]
which, combined with mt-set, easily gives us a to-set function.
Sets can contain many kinds of values, but not necessarily any kind: we need to be able to check for two values being equal [which is a requirement for a set, but not for a list!], which cant be done with all values [REF]; and sometimes we might even want the elements to obey an ordering [Converting Values to Ordered Values]. Numbers satisfy both characteristics.
15.1Representing Sets by Lists
In what follows we will see multiple different representations of sets, so we will want names to tell them apart. Well use LSet to stand for sets represented as lists.
As a starting point, lets consider the implementation of sets using lists as the underlying representation. After all, a set appears to merely be a list wherein we ignore the order of elements.
15.1.1Representation ChoicesThe empty list can stand in for the empty set
type LSet = List mt-set = empty
and we can presumably define size as
fun size[s :: LSet] -> Number: s.length[] end
There is a subtle difference between lists and sets. The list
because the first list has length two whereas the second has length one. Treated as a set, however, the two are the same: they both have size one. Thus, our implementation of size above is incorrect if we dont take into account duplicates [either during insertion or while computing the size].
We might falsely make assumptions about the order in which elements are retrieved from the set due to the ordering guaranteed provided by the underlying list representation. This might hide bugs that we dont discover until we change the representation.
We might have chosen a set representation because we didnt need to care about order, and expected lots of duplicate items. A list representation might store all the duplicates, resulting in significantly more memory use [and slower programs] than we expected.
To avoid these perils, we have to be precise about how were going to use lists to represent sets. One key question [but not the only one, as well soon see [REF]] is what to do about duplicates. One possibility is for insert to check whether an element is already in the set and, if so, leave the representation unchanged; this incurs a cost during insertion but avoids unnecessary duplication and lets us use length to implement size. The other option is to define insert as linkliterally,
and have some other procedure perform the filtering of duplicates.
15.1.2Time ComplexityWhat is the complexity of this representation of sets? Lets consider just insert, check, and size. Suppose the size of the set is \[k\] [where, to avoid ambiguity, we let \[k\] represent the number of distinct elements]. The complexity of these operations depends on whether or not we store duplicates:
If we dont store duplicates, then size is simply length, which takes time linear in \[k\]. Similarly, check only needs to traverse the list once to determine whether or not an element is present, which also takes time linear in \[k\]. But insert needs to check whether an element is already present, which takes time linear in \[k\], followed by at most a constant-time operation [link].
If we do store duplicates, then insert is constant time: it simply links on the new element without regard to whether it already is in the set representation. check traverses the list once, but the number of elements it needs to visit could be significantly greater than \[k\], depending on how many duplicates have been added. Finally, size needs to check whether or not each element is duplicated before counting it.
What is the time complexity of size if the list has duplicates?
One implementation of size is
fun size[s :: LSet] -> Number: cases [List] s: | empty => 0 | link[f, r] => if r.member[f]: size[r] else: 1 + size[r] end end end
Lets now compute the complexity of the body of the function, assuming the number of distinct elements in s is \[k\] but the actual number of elements in s is \[d\], where \[d \geq k\]. To compute the time to run size on \[d\] elements, \[T[d]\], we should determine the number of operations in each question and answer. The first question has a constant number of operations, and the first answer also a constant. The second question also has a constant number of operations. Its answer is a conditional, whose first question [r.member[f] needs to traverse the entire list, and hence has \[O[[k -> d]]\] operations. If it succeeds, we recur on something of size \[T[d-1]\]; else we do the same but perform a constant more operations. Thus \[T[0]\] is a constant, while the recurrence [in big-Oh terms] is
\begin{equation*}T[d] = d + T[d-1]\end{equation*}
Thus \[T \in O[[d \rightarrow d^2]]\]. Note that this is quadratic in the number of elements in the list, which may be much bigger than the size of the set.
Now that we have two representations with different complexities, its worth thinking about how to choose between them. To do so, lets build up the following table. The table distinguishes between the interface [the set] and the implementation [the list], becauseowing to duplicates in the representationthese two may not be the same. In the table well consider just two of the most common operations, insertion and membership checking:
With Duplicates | Without Duplicates | |||
insert | is-in | insert | is-in | |
Size of Set | constant | linear | linear | linear |
Size of List | constant | linear | linear | linear |
A naive reading of this would suggest that the representation with duplicates is better because its sometimes constant and sometimes linear, whereas the version without duplicates is always linear. However, this masks a very important distinction: what the linear means. When there are no duplicates, the size of the list is the same as the size of the set. However, with duplicates, the size of the list can be arbitrarily larger than that of the set!
Based on this, we can draw several lessons:
Which representation we choose is a matter of how much duplication we expect. If there wont be many duplicates, then the version that stores duplicates pays a small extra price in return for some faster operations.
Which representation we choose is also a matter of how often we expect each operation to be performed. The representation without duplication is in the middle: everything is roughly equally expensive [in the worst case]. With duplicates is at the extremes: very cheap insertion, potentially very expensive membership. But if we will mostly only insert without checking membership, and especially if we know membership checking will only occur in situations where were willing to wait, then permitting duplicates may in fact be the smart choice. [When might we ever be in such a situation? Suppose your set represents a backup data structure; then we add lots of data but very rarelyindeed, only in case of some catastropheever need to look for things in it.]
Another way to cast these insights is that our form of analysis is too weak. In situations where the complexity depends so heavily on a particular sequence of operations, big-Oh is too loose and we should instead study the complexity of specific sequences of operations. We will address precisely this question later [Halloween Analysis].
Moreover, there is no reason a program should use only one representation. It could well begin with one representation, then switch to another as it better understands its workload. The only thing it would need to do to switch is to convert all existing data between the representations.
How might this play out above? Observe that data conversion is very cheap in one direction: since every list without duplicates is automatically also a list with [potential] duplicates, converting in that direction is trivial [the representation stays unchanged, only its interpretation changes]. The other direction is harder: we have to filter duplicates [which takes time quadratic in the number of elements in the list]. Thus, a program can make an initial guess about its workload and pick a representation accordingly, but maintain statistics as it runs and, when it finds its assumption is wrong, switch representationsand can do so as many times as needed.
15.1.4Other OperationsImplement the remaining operations catalogued above [] under each list representation.
remove :: [Set, T -> Set]
under each list representation. What difference do you see?
Suppose youre asked to extend sets with these operations, as the set analog of first and rest:
one :: [Set -> T] others :: [Set -> T]
You should refuse to do so! Do you see why?
With lists the first element is well-defined, whereas sets are defined to have no ordering. Indeed, just to make sure users of your sets dont accidentally assume anything about your implementation [e.g., if you implement one using first, they may notice that one always returns the element most recently added to the list], you really ought to return a random element of the set on each invocation.
Unfortunately, returning a random element means the above interface is unusable. Suppose s is bound to a set containing 1, 2, and 3. Say the first time one[s] is invoked it returns 2, and the second time 1. [This already means one is not a functionan issue well get to elsewhere [REF].] The third time it may again return 2. Thus others has to remember which element was returned the last time one was called, and return the set sans that element. Suppose we now invoke one on the result of calling others. That means we might have a situation where one[s] produces the same result as one[others[s]].
Why is it unreasonable for one[s] to produce the same result as one[others[s]]?
Suppose you wanted to extend sets with a subset operation that partitioned the set according to some condition. What would its type be? See [REF join lists] for a similar operation.
The types we have written above are not as crisp as they could be. Define a has-no-duplicates predicate, refine the relevant types with it, and check that the functions really do satisfy this criterion.
15.2Making Sets Grow on Trees
Lets start by noting that it seems better, if at all possible, to avoid storing duplicates. Duplicates are only problematic during insertion due to the need for a membership test. But if we can make membership testing cheap, then we would be better off using it to check for duplicates and storing only one instance of each value [which also saves us space]. Thus, lets try to improve the time complexity of membership testing [and, hopefully, of other operations too].
It seems clear that with a [duplicate-free] list representation of a set, we cannot really beat linear time for membership checking. This is because at each step, we can eliminate only one element from contention which in the worst case requires a linear amount of work to examine the whole set. Instead, we need to eliminate many more elements with each comparisonmore than just a constant.
In our handy set of recurrences [Solving Recurrences], one stands out: \[T[k] = T[k/2] + c\]. It says that if, with a constant amount of work we can eliminate half the input, we can perform membership checking in logarithmic time. This will be our goal.
Before we proceed, its worth putting logarithmic growth in perspective. Asymptotically, logarithmic is obviously not as nice as constant. However, logarithmic growth is very pleasant because it grows so slowly. For instance, if an input doubles from size \[k\] to \[2k\], its logarithmand hence resource usagegrows only by \[\log 2k - \log k = \log 2\], which is a constant. Indeed, for just about all problems, practically speaking the logarithm of the input size is bounded by a constant [that isnt even very large]. Therefore, in practice, for many programs, if we can shrink our resource consumption to logarithmic growth, its probably time to move on and focus on improving some other part of the system.
15.2.1Converting Values to Ordered ValuesWe have actually just made an extremely subtle assumption. When we check one element for membership and eliminate it, we have eliminated only one element. To eliminate more than one element, we need one element to speak for several. That is, eliminating that one value needs to have safely eliminated several others as well without their having to be consulted. In particular, then, we can no longer compare for mere equality, which compares one set element against another element; we need a comparison that compares against an element against a set of elements.
To do this, we have to convert an arbitrary datum into a datatype that permits such comparison. This is known as hashing. A hash function consumes an arbitrary value and produces a comparable representation of it [its hash]most commonly [but not strictly necessarily], a number. A hash function must naturally be deterministic: a fixed value should always yield the same hash [otherwise, we might conclude that an element in the set is not actually in it, etc.]. Particular uses may need additional properties: e.g., below we assume its output is partially ordered.
Let us now consider how one can compute hashes. If the input datatype is a number, it can serve as its own hash. Comparison simply uses numeric comparison [e.g., Boolean: cases [BT] s: | leaf => false | node[v, l, r] => if e == v: true else: is-in-bt[e, l] or is-in-bt[e, r] end end end