The goal of this entire project is to give you lots of practice with linked data structures and with most of the programming techniques taught in this course: inheritance, recursion, exceptions, unit testing, etc.
The term tests and final exam will contain questions based on the concepts in this project — they are major concepts in the course after all! If you let your partner(s) do too much of the work, you are very likely to do badly on the tests. We want to make sure that everyone understands how to solve this project, individually. If you are feeling lost, don't wait: talk to us now!
Now that you have a basic implementation of the Multiset ADT working, and a (hopefully) thorough suite of test cases, it is time to turn our attention back to the original topic of this project...
There are many different data structures we could use to implement the Multiset ADT: dictionaries, built-in lists, linked lists, etc. In fact, a dictionary would be particularly well-suited to this task. But using a dictionary would not give you a chance to practice working with linked data structures and recursion... so we'll do something a bit more complicated instead! :)
Let's think about the efficiency of linked lists compared to built-in lists. As we know, linked lists have one big advantage over built-in lists (it takes only constant time to insert or remove a value once its position is known) but one big disadvantage (it takes linear time to find a value, in the worst case).
Of course, it also takes linear time to find a value in a built-in list, in the worst case, unless the list is sorted — in that case we can use binary search to find any value in worst-case time O(log n).
What if we kept the elements sorted in a linked list? Unfortunately, this would not allow us to search faster: in order to compare a value with the middle element of the list, we would still need to traverse through every element in the first half of the list, unlike in a built-in list where we can access the middle element directly (in constant time).
But wait, couldn't we simply keep an extra pointer directly to the middle node of the linked list? This would certainly allow us to access the middle element in constant time, but it wouldn't be enough to allow us to search the entire list faster: after making one comparison with the middle element (in constant time), we would need to access one of the two elements roughly 1/4 or 3/4 of the way through the list, then some element 1/8, 3/8, 5/8, or 7/8 of the way through the list, etc. The only way to have constant-time access to each of those elements would be to maintain all of these additional pointers, which seems to lead right back to storing the values directly into a built-in list.
In this part of the project, you will explore one particular approach around this problem, invented by Prof. William Pugh in 1989.
Prof. Pugh's idea was simple: use a sorted linked list in which some of the nodes have extra pointers that "skip" ahead some number of positions in the list. This is done in a layered fashion, so that level 1 (at the bottom) is a simple linked list where each node has a link to the next node; level 2 is a linked list where each node (starting with the first) has a link to the node 2 positions ahead; level 3 is a linked list where each node has a link to the node 4 positions ahead; etc. Conceptually, half of the nodes on level i have links that skip over one node in level i−1. See Figure 1 for an illustration.
As illustrated, it is typical to use a header node to store the list of pointers to the first node in the list at each level, and to use a special sentinel marker for the end of the list, whose value is conceptually "infinite."
With this type of structure, a list storing n elements requires log2 n levels, and searching for a value x becomes an O(log n) operation in the worst-case: start at the top level and follow pointers until the next element at the current level is larger than x, at which point we drop down one level and repeat, until the bottom level is reached.
However, there is one serious drawback to this idea: every time an element is inserted or removed, the levels of all the following nodes have to be adjusted! We're right back to the performance of built-in lists...
Prof. Pugh's solution to this problem was also simple (though more difficult to analyze): replace the rigid restriction on the levels of nodes by a randomized distribution of levels. That is, instead of requiring that every node at level i skip over exactly one node at level i−1, generate node levels randomly so that this property holds only on average. Figure 2 illustrates what a randomized skip list might look like, and we will describe exactly what it means to assign levels "at random" further below.
There are many ways to implement skip lists, particularly by making use of various built-in data structures in Python. For this project, we will require that you use a particular kind of linked data structure — remember the main purpose of the project!
Each level of the list will be stored as a separate singly-linked list, where each node has an additional pointer to the node that stores the same value one level "below". Figure 3 illustrates the basic data structure you must use. Yes, this is a somewhat wasteful data structure: there are a number of "extra" node objects and there is some duplication of information (each element is repeated in multiple nodes). But since the level of the entire structure is expected to be roughly equal to log n (where n is the number of values in the list), the amount of "waste" is not too large so we can live with it. And in any case, you don't have a choice! :)
next" and "
Now that we have discussed how to store information, we explain how to carry out three basic operations on skip lists: searching, inserting, and removing. These operations are not conceptually difficult, but there are many details to keep track of. Take the time to draw lots of pictures representing different situations to really understand what's going on and what you have to do.
To search for an element x in a skip list, we start at the top level
and examine nodes until we find a value greater than or equal to x
(or reach a sentinel node), on the current level. It is important to keep track
of the previous node on the current level, because if we have not found
x, we move down one level below the previous node and repeat this
process. Either we will eventually find x, or we will eventually
"fall off" the bottom level. This process is illustrated in Figure 4,
where we are searching for element
10 and the nodes and links
examined have been highlighted — links are examined
top-to-bottom and left-to-right.
Conceptually, removing an element from a skip list involves searching for that
element, then updating the appropriate "forward" links to bypass the nodes that
store the element. However, unlike a regular search, it is important to carry
out a "full search" that goes all the way down to the bottom level (so that
the element is removed from every level that it belongs to). We've
highlighted the nodes and links that will be examined during this operation in
Figure 5, where we are removing the element
11. The second
picture shows the resulting list, where the highlighted links are the ones that
were modified to remove the node storing
Insertion follows a similar pattern: search for the position of the value to
insert, then update forward links (on every relevant level) to splice in new
nodes to store the element. For example, suppose we insert value
12, with 4 levels, into the last skip list in Figure 5. The
insertion algorithm would examine the nodes and links highlighted in the first
picture in Figure 6 and end up with the second picture in that figure
(where the highlighted links are the ones updated for the insertion).
There is one issue that remains to discuss: how to determine the number of
levels for new values. When a new value is inserted, its level is determined
randomly and new individual nodes are inserted at each level, starting at the
bottom. For each level, we use function
random() from module
random to generate a floating-point number in the range [0,1) and
compare its value with a fixed "probability parameter"
p = 0.5 (a constant for skip lists). The comparison is
considered a success if the random value is strictly less than
p, in which case the node is extended up by one level. This is
repeated until the random number generated is greater than or equal to
p, at which point no more levels are added for the new value. If the
node to be inserted is supposed to contain more levels than the rest of the
list, simply add new levels to the list as needed, in order to accommodate the
Write code for the various "node" classes you will use to implement the skip list data structure described above. Remember to use an appropriate inheritance hierarchy for the various types of nodes. For now, just put in a constructor in each node class: you will likely want to add more methods as you write the rest of the code and discover that recursive helper methods in one or more of your node classes would be useful.
Then, re-implement methods
remove in class
that they correspond to a true randomized skip list data structure as described
above. To help you develop and debug your code, you may find it useful to also
__str__ so that it includes the full linked structure
of the list at every level. For example, the string for the second skip list in
Figure 6 above would look like this:
-> 12 -> -> 7 -> 12 -> 17 -> -> 3 -> 7 -> 12 -> 17 -> -> 3 -> 5 -> 7 -> 9 -> 12 -> 13 -> 15 -> 17 ->
It would be nice if the repeated values in the list lined up nicely (for example, if all of the 12's were directly above one another), but don't worry about this for now: we will fix it later.
Next, re-implement any other method you added to class
SkipList to work with the new structure of your skip lists.
You should not have to make any changes to class
MultiSet in order for everything to continue to work
correctly — if you do, then it is a sign that your
initial design was not modular enough: your two classes depended on each other
too closely. Take this opportunity to refactor your code (meaning that you keep
the functionality the same but rewrite the code to clean it up). While you are
doing this, add a
container parameter to the constructor for class
MultiSet, to specify the initial content of the multiset (with a
default value of
). I know many of you had done this already and
then had to take it out for Part I of the project, and I'm sorry to make
you go back-and-forth on this; on the plus side, making this work requires very
few small changes to the existing code.
The following points give you some hints about how to design your code. You are under no obligation to follow these hints! But if you take the time to think about them and understand them, they should help you write better code.
__lt__/etc. in your node classes. By carefully changing the implementation of these methods for the various types of nodes, and being careful to call the appropriate methods, you can simplify your searching code by removing the need for special cases.
__getitem__ in class
allow access to elements in a skip list by their index. One way to achieve this
would be to simply move along the links on the bottom level of the skip list
and keep a count of the number of elements
encountered — like you did back in Lab 5. But that
would take time O(n) to access the element at index
n... We want to do better!
To do this more efficiently, we store an additional attribute
skip" in each node: the
skip of a node is the
difference between the next node's index and this node's index (we set the
index of head nodes to be -1). This is illustrated in Figure 7 below,
where the skip value for each node is a small number next to the node's
element. We have also indicated the index of each node at the bottom of the
picture, to make it easier for you to figure out the skip value for each
node — these index values are not
stored anywhere in the skip list! — they are included
in the picture only for reference purposes.
Why don't we just store the index of each node directly in the node? Wouldn't that be simpler? The problem with this idea is that when a new element is inserted, or an old element removed, the index of every node that follows would have to be adjusted! By contrast, the skip values for those nodes remain the same: only the skip values for the node inserted (or removed) and the nodes immediately before the node inserted (or removed) need to be changed. If you're not sure you understand this, go back to the earlier pictures above that show new nodes being inserted (or old nodes being removed) and take the time to fill in the skip values for every node in the "before" and the "after" picture: you will see exactly which nodes need to change and how. Also, you should notice an interesting relationship between a node's skip attribute and the skip attributes of the nodes below — this will be useful later.
Once we have skip values stored in each node, finding the node at a certain index is simple: start at the topmost head node, at index -1, and move forward as long as the skip value plus the current index is no bigger than the index we want. If moving forward would take us past the index we want, drop down one level and keep going. For example, if we are looking for the element at index 4 in the skip list pictured in Figure 7, the following steps will be executed — take the time to follow along in the picture to see exactly which nodes and attributes are examined during this process:
7on the second level, and our current index is 2. Since this is still not the index we want, consider the current node's skip value: 2 + 5 = 7 > 4, so drop down one level.
7on the third level, our current index is 2. Considering the current node's skip value, we find that 2 + 2 = 4. Since this is the index we want, move to the next node and return the element we find there (
Once you have this working, change the
__str__ method in class
SkipList so that the values in each node line up properly. You can
make use of the skip value in each node to figure out how many spaces to put in
between successive values. It's OK if your code does not work when the list
contains very "long" values — in other words, you can write
your code to represent each value using a fixed number of characters, so that
it works fine if every value is small enough, but longer values will throw off
the alignment. For example, the list in Figure 7 above might generate a
string like the one below, assuming that every value is represented using three
characters. Once you get this working, it will be much easier to test and debug
your code because you can see the full linked structure of your skip list!
- > 12 - > - > 7 - > 12 - > 17 -> -> 3 - > 7 - > 12 - > 17 -> -> 3 -> 5 -> 7 -> 9 -> 12 -> 13 -> 15 -> 17 ->
One last thing you need to think about: when values are inserted or removed,
the skip values of some existing nodes will need to change, as pointed out
above. Make sure that you make the appropriate changes to your
remove methods to keep the skip values
Consider the code for method
__le__ in the official solution for
for e in self.skiplist: if self.count(e) > other.count(e): return False return True
Let's think about how efficient this code is. The for loop takes linear time to
go through each value in the skip list: we simply have to follow each of the
links at the bottom level of the list. For each iteration of the loop, the body
makes two calls to the
count method, and each call of this method
involves running another for loop over all of the elements in a skip list. So,
overall, the total running time for
quadratic — more precisely,
O(n(n + m)), where
n is the number of elements in
self and m is
the number of elements in
If you think about it, this is very bad! If we were doing this by hand, we could do much better: since both skip lists are sorted, just keep track of our position in each list, at the bottom level, and walk forward one element at a time to check that each element in the first list is also in the second one. This would take time only O(n + m).
However, the way the official solution code is written right now, there is no
way to achieve this: it would be considered very poor design
to access skip list nodes from outside of the
because the nodes used by skip lists are private classes to the
module. And it would still be poor design to make these classes "public",
because it would break modularity: if code outside of the
module is allowed to access nodes, then we are locked into using the node class
as it is and cannot easily change it (like we did when adding the "skip"
So what can we do instead? We will simply add the functionality we want
directly to class
SkipList and then just call the new method from
MultiSet code. In this example, we would implement
__le__ in class
SkipList (being careful to
write it so it takes only linear time) and then the code for class
MultiSet would be simplified to a one-liner:
# Done in class SkipList to improve efficiency from quadratic to linear. return self.skiplist <= other.skiplist
For this last part of the project, you must repeat this analysis for
every method in class
MultiSet. First, add a short
comment to each method to indicate its current worst-case running time (in
big-Oh terms). Next, for each method where you find the original code
inefficient and where there is an obvious way to make use of the structure of
the skip list to speed things up, modify the code to improve its efficiency.
This may involve doing something like the example above (adding a new method to
SkipList and changing the corresponding method in class
MultiSet to simply call the new
SkipList method). But
you may be able to make improvements by simply rewriting the code in class
MultiSet directly, without having to add anything new to class
SkipList. Whatever you do, make sure to add comments to indicate
the changes you have made and the improvement in the running time, similar to
the one shown above.
Your grade will depend partly on how much you have improved the code's big-Oh running time. But of course, you should not sacrifice good code design just for the sake of marginal gains in efficiency! If you are considering changes that improve the running time by no more than some constant factor, make sure those changes are also well-motivated in terms of the code design: your methods should always be easy to read and understand, by keeping them as simple as possible and using comments where appropriate to explain what you are doing.
You may not use built-in lists or dictionaries (or
any other container type from a Python module) to store any of the data in your
implementation of class
SkipList: doing so will result in an
automatic failure for this project!
Your code must be written following the style conventions from PEP8, as described on the
Homework page. In addition to the style conventions that are checked by the
pep8.py script, pay attention to the guidelines that concern
variable, function, and class names, docstrings and comments, design principles
(making good use of helpers to remove duplication), etc.
In particular, each file you submit must start with an
appropriate docstring. Please include the CDF username of each group
member at the end of this docstring, preceded by the word
Authors:". Do this even if you worked alone. For
example, your docstring might look like this:
"""An implementation of the Multiset ADT using a Skip List. ...details go here... Authors: fpitt """
Authors:" line in your docstring takes the place of your
signature, to confirm that you have read and understood the policy on Academic
Offences given on the Course Information Sheet, and that your submission
contains only the work of the students listed. Your submission
cannot be graded if any of your files is
missing this line!
Submit your files under "Project 2" on MarkUs