CSC 148 H1, Winter 2013

Project, Part II (Final Version)

Due: before 12:00 (noon) on Friday 5 April 2013

Purpose

The goal of this entire project is to give you lots of practice with linked data structures and with most of the programming techniques taught in this course: inheritance, recursion, exceptions, unit testing, etc.

Warning

The term tests and final exam will contain questions based on the concepts in this project — they are major concepts in the course after all! If you let your partner(s) do too much of the work, you are very likely to do badly on the tests. We want to make sure that everyone understands how to solve this project, individually. If you are feeling lost, don't wait: talk to us now!


Skip Lists

Overview

Now that you have a basic implementation of the Multiset ADT working, and a (hopefully) thorough suite of test cases, it is time to turn our attention back to the original topic of this project...

There are many different data structures we could use to implement the Multiset ADT: dictionaries, built-in lists, linked lists, etc. In fact, a dictionary would be particularly well-suited to this task. But using a dictionary would not give you a chance to practice working with linked data structures and recursion... so we'll do something a bit more complicated instead! :)

Let's think about the efficiency of linked lists compared to built-in lists. As we know, linked lists have one big advantage over built-in lists (it takes only constant time to insert or remove a value once its position is known) but one big disadvantage (it takes linear time to find a value, in the worst case).

Of course, it also takes linear time to find a value in a built-in list, in the worst case, unless the list is sorted — in that case we can use binary search to find any value in worst-case time O(log n).

What if we kept the elements sorted in a linked list? Unfortunately, this would not allow us to search faster: in order to compare a value with the middle element of the list, we would still need to traverse through every element in the first half of the list, unlike in a built-in list where we can access the middle element directly (in constant time).

But wait, couldn't we simply keep an extra pointer directly to the middle node of the linked list? This would certainly allow us to access the middle element in constant time, but it wouldn't be enough to allow us to search the entire list faster: after making one comparison with the middle element (in constant time), we would need to access one of the two elements roughly 1/4 or 3/4 of the way through the list, then some element 1/8, 3/8, 5/8, or 7/8 of the way through the list, etc. The only way to have constant-time access to each of those elements would be to maintain all of these additional pointers, which seems to lead right back to storing the values directly into a built-in list.

In this part of the project, you will explore one particular approach around this problem, invented by Prof. William Pugh in 1989.

Concept

Prof. Pugh's idea was simple: use a sorted linked list in which some of the nodes have extra pointers that "skip" ahead some number of positions in the list. This is done in a layered fashion, so that level 1 (at the bottom) is a simple linked list where each node has a link to the next node; level 2 is a linked list where each node (starting with the first) has a link to the node 2 positions ahead; level 3 is a linked list where each node has a link to the node 4 positions ahead; etc. Conceptually, half of the nodes on level i have links that skip over one node in level i−1. See Figure 1 for an illustration.

Picture of an 'ideal' skip list.

Figure 1: An "ideal" skip list.

As illustrated, it is typical to use a header node to store the list of pointers to the first node in the list at each level, and to use a special sentinel marker for the end of the list, whose value is conceptually "infinite."

With this type of structure, a list storing n elements requires log2 n levels, and searching for a value x becomes an O(log n) operation in the worst-case: start at the top level and follow pointers until the next element at the current level is larger than x, at which point we drop down one level and repeat, until the bottom level is reached.

However, there is one serious drawback to this idea: every time an element is inserted or removed, the levels of all the following nodes have to be adjusted! We're right back to the performance of built-in lists...

Prof. Pugh's solution to this problem was also simple (though more difficult to analyze): replace the rigid restriction on the levels of nodes by a randomized distribution of levels. That is, instead of requiring that every node at level i skip over exactly one node at level i−1, generate node levels randomly so that this property holds only on average. Figure 2 illustrates what a randomized skip list might look like, and we will describe exactly what it means to assign levels "at random" further below.

Picture of a typical randomized skip list.

Figure 2: A typical skip list with randomized levels.

Implementation issues

There are many ways to implement skip lists, particularly by making use of various built-in data structures in Python. For this project, we will require that you use a particular kind of linked data structure — remember the main purpose of the project!

Now that we have discussed how to store information, we explain how to carry out three basic operations on skip lists: searching, inserting, and removing. These operations are not conceptually difficult, but there are many details to keep track of. Take the time to draw lots of pictures representing different situations to really understand what's going on and what you have to do.

Search

To search for an element x in a skip list, we start at the top level and examine nodes until we find a value greater than or equal to x (or reach a sentinel node), on the current level. It is important to keep track of the previous node on the current level, because if we have not found x, we move down one level below the previous node and repeat this process. Either we will eventually find x, or we will eventually "fall off" the bottom level. This process is illustrated in Figure 4, where we are searching for element 10 and the nodes and links examined have been highlighted — links are examined top-to-bottom and left-to-right.

Picture of the search process for skip lists.

Figure 4: The nodes and links examined during search (from top-to-bottom, and left-to-right on each level).

Removal

Conceptually, removing an element from a skip list involves searching for that element, then updating the appropriate "forward" links to bypass the nodes that store the element. However, unlike a regular search, it is important to carry out a "full search" that goes all the way down to the bottom level (so that the element is removed from every level that it belongs to). We've highlighted the nodes and links that will be examined during this operation in Figure 5, where we are removing the element 11. The second picture shows the resulting list, where the highlighted links are the ones that were modified to remove the node storing 11.

Picture of the removal process for skip lists.

Figure 5: The nodes and links examined during removal (from top-to-bottom, and left-to-right on each level), and the resulting list.

Insertion

Insertion follows a similar pattern: search for the position of the value to insert, then update forward links (on every relevant level) to splice in new nodes to store the element. For example, suppose we insert value 12, with 4 levels, into the last skip list in Figure 5. The insertion algorithm would examine the nodes and links highlighted in the first picture in Figure 6 and end up with the second picture in that figure (where the highlighted links are the ones updated for the insertion).

Picture of the insertion process for skip lists.

Figure 6: The nodes and links examined during insertion (from top-to-bottom, and left-to-right on each level), and the resulting list.

Determining Levels

There is one issue that remains to discuss: how to determine the number of levels for new values. When a new value is inserted, its level is determined randomly and new individual nodes are inserted at each level, starting at the bottom. For each level, we use function random() from module random to generate a floating-point number in the range [0,1) and compare its value with a fixed "probability parameter" p = 0.5 (a constant for skip lists). The comparison is considered a success if the random value is strictly less than p, in which case the node is extended up by one level. This is repeated until the random number generated is greater than or equal to p, at which point no more levels are added for the new value. If the node to be inserted is supposed to contain more levels than the rest of the list, simply add new levels to the list as needed, in order to accommodate the new node.

Your First Task

Write code for the various "node" classes you will use to implement the skip list data structure described above. Remember to use an appropriate inheritance hierarchy for the various types of nodes. For now, just put in a constructor in each node class: you will likely want to add more methods as you write the rest of the code and discover that recursive helper methods in one or more of your node classes would be useful.

Then, re-implement methods __contains__, insert, and remove in class SkipList so that they correspond to a true randomized skip list data structure as described above. To help you develop and debug your code, you may find it useful to also write method __str__ so that it includes the full linked structure of the list at every level. For example, the string for the second skip list in Figure 6 above would look like this:

 -> 12 -> 
 -> 7 -> 12 -> 17 -> 
 -> 3 -> 7 -> 12 -> 17 -> 
 -> 3 -> 5 -> 7 -> 9 -> 12 -> 13 -> 15 -> 17 -> 

It would be nice if the repeated values in the list lined up nicely (for example, if all of the 12's were directly above one another), but don't worry about this for now: we will fix it later.

Next, re-implement any other method you added to class SkipList to work with the new structure of your skip lists. You should not have to make any changes to class MultiSet in order for everything to continue to work correctly — if you do, then it is a sign that your initial design was not modular enough: your two classes depended on each other too closely. Take this opportunity to refactor your code (meaning that you keep the functionality the same but rewrite the code to clean it up). While you are doing this, add a container parameter to the constructor for class MultiSet, to specify the initial content of the multiset (with a default value of []). I know many of you had done this already and then had to take it out for Part I of the project, and I'm sorry to make you go back-and-forth on this; on the plus side, making this work requires very few small changes to the existing code.

Discussion and advice

The following points give you some hints about how to design your code. You are under no obligation to follow these hints! But if you take the time to think about them and understand them, they should help you write better code.


Extending Skip Lists

Implement method __getitem__ in class SkipList, to allow access to elements in a skip list by their index. One way to achieve this would be to simply move along the links on the bottom level of the skip list and keep a count of the number of elements encountered — like you did back in Lab 5. But that would take time O(n) to access the element at index n... We want to do better!

To do this more efficiently, we store an additional attribute "skip" in each node: the skip of a node is the difference between the next node's index and this node's index (we set the index of head nodes to be -1). This is illustrated in Figure 7 below, where the skip value for each node is a small number next to the node's element. We have also indicated the index of each node at the bottom of the picture, to make it easier for you to figure out the skip value for each node — these index values are not stored anywhere in the skip list! — they are included in the picture only for reference purposes.

Picture of a skip list with indices and new attribute.

Figure 7: Skip list with "skip" attribute values and indices.

Why don't we just store the index of each node directly in the node? Wouldn't that be simpler? The problem with this idea is that when a new element is inserted, or an old element removed, the index of every node that follows would have to be adjusted! By contrast, the skip values for those nodes remain the same: only the skip values for the node inserted (or removed) and the nodes immediately before the node inserted (or removed) need to be changed. If you're not sure you understand this, go back to the earlier pictures above that show new nodes being inserted (or old nodes being removed) and take the time to fill in the skip values for every node in the "before" and the "after" picture: you will see exactly which nodes need to change and how. Also, you should notice an interesting relationship between a node's skip attribute and the skip attributes of the nodes below — this will be useful later.

Once we have skip values stored in each node, finding the node at a certain index is simple: start at the topmost head node, at index -1, and move forward as long as the skip value plus the current index is no bigger than the index we want. If moving forward would take us past the index we want, drop down one level and keep going. For example, if we are looking for the element at index 4 in the skip list pictured in Figure 7, the following steps will be executed — take the time to follow along in the picture to see exactly which nodes and attributes are examined during this process:

  1. Start with the topmost head node at index -1. This is not the index we want.
  2. Examine this node's skip value. Adding it to the current index yields -1 + 9 = 8. And 8 > 4, so moving forward on this level takes us too far.
  3. Drop down one level (we are now on the head node on the second level from the top). The current index is still -1, so we examine this node's skip value. Adding it to the current index yields -1 + 3 = 2. Since 2 ≤ 4, we move forward.
  4. We are now on the node with value 7 on the second level, and our current index is 2. Since this is still not the index we want, consider the current node's skip value: 2 + 5 = 7 > 4, so drop down one level.
  5. We are now on the node with value 7 on the third level, our current index is 2. Considering the current node's skip value, we find that 2 + 2 = 4. Since this is the index we want, move to the next node and return the element we find there (11).

Once you have this working, change the __str__ method in class SkipList so that the values in each node line up properly. You can make use of the skip value in each node to figure out how many spaces to put in between successive values. It's OK if your code does not work when the list contains very "long" values — in other words, you can write your code to represent each value using a fixed number of characters, so that it works fine if every value is small enough, but longer values will throw off the alignment. For example, the list in Figure 7 above might generate a string like the one below, assuming that every value is represented using three characters. Once you get this working, it will be much easier to test and debug your code because you can see the full linked structure of your skip list!

 -                            >  12 -                     > 
 -              >   7 -       >  12 -              >  17 -> 
 ->   3 -       >   7 -       >  12 -              >  17 -> 
 ->   3 ->   5 ->   7 ->   9 ->  12 ->  13 ->  15 ->  17 -> 

One last thing you need to think about: when values are inserted or removed, the skip values of some existing nodes will need to change, as pointed out above. Make sure that you make the appropriate changes to your insert and remove methods to keep the skip values updated correctly!


Making MultiSets more efficient

Consider the code for method __le__ in the official solution for class MultiSet:

    for e in self.skiplist:
        if self.count(e) > other.count(e):
            return False
    return True

Let's think about how efficient this code is. The for loop takes linear time to go through each value in the skip list: we simply have to follow each of the links at the bottom level of the list. For each iteration of the loop, the body makes two calls to the count method, and each call of this method involves running another for loop over all of the elements in a skip list. So, overall, the total running time for __le__ is quadratic — more precisely, O(n(n + m)), where n is the number of elements in self and m is the number of elements in other.

If you think about it, this is very bad! If we were doing this by hand, we could do much better: since both skip lists are sorted, just keep track of our position in each list, at the bottom level, and walk forward one element at a time to check that each element in the first list is also in the second one. This would take time only O(n + m).

However, the way the official solution code is written right now, there is no way to achieve this: it would be considered very poor design to access skip list nodes from outside of the skiplist module, because the nodes used by skip lists are private classes to the module. And it would still be poor design to make these classes "public", because it would break modularity: if code outside of the skiplist module is allowed to access nodes, then we are locked into using the node class as it is and cannot easily change it (like we did when adding the "skip" values).

So what can we do instead? We will simply add the functionality we want directly to class SkipList and then just call the new method from inside the MultiSet code. In this example, we would implement method __le__ in class SkipList (being careful to write it so it takes only linear time) and then the code for class MultiSet would be simplified to a one-liner:

    # Done in class SkipList to improve efficiency from quadratic to linear.
    return self.skiplist <= other.skiplist

For this last part of the project, you must repeat this analysis for every method in class MultiSet. First, add a short comment to each method to indicate its current worst-case running time (in big-Oh terms). Next, for each method where you find the original code inefficient and where there is an obvious way to make use of the structure of the skip list to speed things up, modify the code to improve its efficiency. This may involve doing something like the example above (adding a new method to class SkipList and changing the corresponding method in class MultiSet to simply call the new SkipList method). But you may be able to make improvements by simply rewriting the code in class MultiSet directly, without having to add anything new to class SkipList. Whatever you do, make sure to add comments to indicate the changes you have made and the improvement in the running time, similar to the one shown above.

Your grade will depend partly on how much you have improved the code's big-Oh running time. But of course, you should not sacrifice good code design just for the sake of marginal gains in efficiency! If you are considering changes that improve the running time by no more than some constant factor, make sure those changes are also well-motivated in terms of the code design: your methods should always be easy to read and understand, by keeping them as simple as possible and using comments where appropriate to explain what you are doing.


Requirements and coding standards

You may not use built-in lists or dictionaries (or any other container type from a Python module) to store any of the data in your implementation of class SkipList: doing so will result in an automatic failure for this project!

Your code must be written following the style conventions from PEP8, as described on the Homework page. In addition to the style conventions that are checked by the pep8.py script, pay attention to the guidelines that concern variable, function, and class names, docstrings and comments, design principles (making good use of helpers to remove duplication), etc.

In particular, each file you submit must start with an appropriate docstring. Please include the CDF username of each group member at the end of this docstring, preceded by the word "Authors:". Do this even if you worked alone. For example, your docstring might look like this:

    """An implementation of the Multiset ADT using a Skip List.
    ...details go here...

    Authors: fpitt
    """

The "Authors:" line in your docstring takes the place of your signature, to confirm that you have read and understood the policy on Academic Offences given on the Course Information Sheet, and that your submission contains only the work of the students listed. Your submission cannot be graded if any of your files is missing this line!

Submit your files under "Project 2" on MarkUs