CMSC 341 Data Structures All Sections Fall 2018 Home Sections Syllabus Schedule Homework Projects Resources FAQ Staff Project 5: Incremental Rehash Due: Tuesday, December 11, 8:59:59pm Objectives The objective of this programming assignment is to have you practice implementing a hash table. Background Your final programming assignment is to implement a hash table for strings. Your implemenation will use open addressing, linear probing and the division method. We even give you the function to compute the hashcode of a string: The problem is that linear probing is vulnerable to primary clustering. So, as with previous projects, you will add some optimizations to the basic design. When a hash table is full or has too many collisions, a last resort is torehash. When you rehash, you construct another hash table and insert every item in the old hash table into the new hash table. The new hash table might be bigger and could use a different hash function. The hope is that you were just unlucky with collisions and that the new hash table will not have as many. Rehashing is very expensive in terms of running time since you have to re-insert every item from the old hash table into the new hash table. Alternatively, you can rehashincrementallyand not move the items all at once. Incremental rehashing will not lower the overall running time of rehashing, but it will prevent your program from stalling for a long time while a hash table is being copied. During an incremental rehash, there are two hash tables in the data structure: the old hash table and the new hash table. When you insert a new string, you insert it in the new hash table. When you search for a string, the string may be in either the old hash table or the new hash table. You have to look in both. Similarly, a remove operation will have to look in both tables and remove the item from the appropriate one. During the incremental rehash phase, every operation (i.e., insert, find and remove) will also move a few items from the old table to the new table. After a number of operations, most of the items will have migrated to the new table. The stragglers in the old table will be moved to the new table and the old table can be discarded. Your Assignment Your assignment is to implement a hash table data structure that uses incremental rehashing as described above. We are assuming the concept of a standard hash table that uses open addressing and linear probing is already familiar to you from class. We are also assuming you understand lazy deletion and why it is necessary when using open addressing. Here, we will cover just the incremental rehashing process. Rehashing is triggered in two ways. First, if the load factor of the table exceeds 0.5 (i.e., more than half of the hash slots are filled), your data structure should begin the incremental rehashing. (There is no ambiguity about the >= 50% vs > 50% question, since the restriction that the table size must be prime and at least 101 implies the size must be odd.) You should check the load factor at the beginning of every insert, find and remove operation. For insert, this makes sense because if you are going to rehash due to a load factor exceeding 0.5, then you want to start inserting in the new hash table right away. For find and remove, checking the load factor first, and triggering the rehash if needed also clarifies what you are supposed to do with the cluster you find in the operation. If you are already in rehash mode, yes, you should move that cluster to the new table. Also, since the load factor is being checked before we do the insert and remove, you dont need to take into account the size change from the insert and remove when checking the load factor. The second way that incremental rehashing is triggered is if linear probing during an insert, find or remove operation requires examining 10 or more hash slots. We usually prefer not to be too specific about implementation details that do not affect the performance of a data structure, but in order to prevent a bunch of questions and also to make grading easier, lets be really specific about measuring the length of a linear probe. Suppose a stringxhashes to index 67: Ifxis inserted in slot 75, the probe length is 9 (since you had to examine slots 67 thru 75) and this does not trigger a rehash. Ifxis inserted in slot 76, then the probe length is 10 and a rehash would be triggered. The current cluster that contains slot 76 should be moved to the new hash table immediately. Ifxis found in slot 75 by a find operation, the probe length is 9 and we do not rehash. Ifxis found in slot 76, then we do start a rehash. The current cluster that contains slot 76 should be moved to the new hash table immediately. The remove operation is the same as the find operation, since we have to look for an item before we can remove it. The slot markedDELETEDis still part of the cluster, there is no item to insert in the new table, but the deleted slot joins the slots before it and the slots after it into one cluster. This cluster should be moved immediately after the deletion, if a rehash is triggered during the operation. In either trigger case, the new hash table should have a size that brings the load factor to 0.25 if all of the elements of the old table were moved to the new table at once. That is, the size of the new table should be four times the number of items in the old table when the incremental hashing is triggered. Next, the new size should then be rounded up to the next prime, of course. Third, the new table size must be different from the old table size; if the number you computed is the same as the current size, increment the size to the next prime. Last, be sure to maintain a minimum table size of 101 and a maximum size of 199,999 for this project. If, during rehashing, you have to create a table with more than 199,999 items, then just give up and throw anout_of_rangeexception. Note that if incremental rehashing was triggered because of a long linear probe, the new table might actually be smaller than the old table.Instead of moving individual items from the old table to the new table, you will move clusters. These are contiguous hash slots that are occupied by items. Clusters are bad for linear probing because they lengthen the running time for a search. (We do not know that an item is not in the hash table until we have reached the end of the cluster.) Also, large clusters tend to get larger because the probability that a new item hashes to a slot in a cluster increases with the size of a cluster. Thus, when incremental rehashing has been triggered, every time a slottin the old table is accessed (during insert, find or remove), we will rehash the cluster in the old hash table surroundingt. For example, the following is a part of the dump from a hash table with size 137. The numbers in the parentheses after each word is the hash index of the word. The words in slots 67 through 87 form a cluster. If we did a search for’aberdeen’, which hashes to slot 68, we wont find the word in the linear probing until we reach slot 80. This will trigger the incremental rehashing since the linear probing had to examine 10 or more slots (13, to be exact) to find’aberdeen’. Thus, every item in the cluster from slot 67 through 87 will be moved to the new table. Note that there are items in the clusterbeforeslot 68 where’aberdeen’hashed to. Also, in the linear probing scheme, when an item is deleted, the slot it occupied is marked as deleted rather than emptied (otherwise linear probing wont work for later searches). So, the cluster also includes deleted slots, but obviously there is no item to move to the new table for that slot. Subsequently, if we do search on’abbreviation’then the cluster in slots 89 through 91 will be moved to the new table. Note that in the example above, if we have not entered incremental rehashing and we did a search for’abel’, this would not trigger an incremental rehash because’abel’hashes to index 76 and we found it in slot 79. The linear probing looked at only 4 slots. That is, incremental rehashing is not triggered by theexistenceof large clusters it is triggered when weencountera long linear probe sequence. After incremental rehashing has started, when do clusters move from the old table to the new table? The answer is: Whenever you touch a string in the old table, the cluster it belongs to should move to the new table. The idea here is that we want to migrate strings from the old table to the new table as much as possible without running down the old table looking for them. (We do that for the final 3%, but thats unavoidable.) So, if you happen to see a string in the old table, you should go ahead and move the cluster that it is part of to the new table. We move clusters and not individual strings, because moving an individual string from the old table would break the linear probe searching (and we do still need to search in the old table). In the example with table size 137 given below, after rehashing has started, suppose someone callsremove(‘trolley’). Since’trolley’is not in the data structure at all, you have to look for’trolley’in both the new table and the old table. With a table of size 137,’trolley’hashes to slot 98. You have to probe slots 98, 99 and 100 of the old table to determine that’trolley’is, in fact, not in the table. Even though the search is unsuccessful, you still did find some strings in the old table, so you should move the cluster consisting of’abdominal’,’abhor’,’abasement’and’abhorrent’in slots 96 through 99 from the old table to the new table. When there is an ongoing rehash, we must also check if the number of items in the old table has dropped below 3% of the total number of items in the data structure. If that is the case, then we should wrap up the rehashing and copy all of the remaining items in the old hash table to the new hash table. The old hash table can then be discarded and we will exit from the rehash mode. This check should be doneat the beginning of everyinsert, find and remove operation. Finally, what should we do if during an incremental rehash, the load factor of the new table exceeds 0.5? or if there is a long linear probe in the new table? Then, we throw up our hands and give up on incremental rehashing. We make a third table that has a table size about 4 times the total number of items in the old and new tables, and move all of the items into the third table at once. Requirements You must design and implement a C++ classHashTablethat uses the incremental rehashing scheme as described above. The following are some firm requirements (i.e., necessary for grading): The name of your class must beHashTable. You must useHashTable.handHashTable.cppfor the filenames (case sensitive) of your header and implementation files. You must use thehashCode()function given above. You must use open addressing and linear probing. You must use the division method where hash index is the value returned byhashCode()modulo the table size. Your table sizes must be prime numbers. If either the user-requested table size or the computed rehashing table size is greater than 199,999, you should You may not use STL classes, not evenstring. You must dynamically allocate arrays of typechar *to hold your C strings. I.e., the declaration of your table should be something like: Your program must not leak memory. We will be hashing strings. The strings are given in a global arraywords. None of the member functions in your implementation of theHashTableclass should use this global array. The driver programs should be the only code that uses thewordsarray. The global variablenumWordshas the size of thewordsarray. Download:words.h. Note that the items in thewordsarray areconst char *strings and not C++ strings. When you store a string in your hash table you should make a copy of the given string. We are providing a function calledint roundUpPrime(int n)that helps you compute good sizes for your hash tables. It takes as a parameter the desired size (either from the user, or the new size computed for rehashing) and rounds it up to the next prime. If the parameter is already prime, it is not rounded up. If the parameter is too large (> 199,999), the function returns 0. Download:primes.cpp. Additional Specifications Here are some additional specifications for theHashTableclass member functions that you have to implement. You will definitely want to add other data members and member functions to theHashTableclass, but the prototypes for the member functions listed here should not be changed. You may create additional classes if you prefer, but all class declarations should be placed inHashTable.hand all implementations should be placed inHashTable.cpp. This is the default constructor for theHashTableclass. The parameternis the requested size of the hash table. Note that it has a default value of 101: that is the minimum allowed size. If the user requests a size smaller than 101, you will override the request and set the size to 101 (i.e., the minimum table size is 101). The maximum allowed size is 199,999, so if the requested size exceeds this, then just give up and throw anout_of_rangeexception. Lastly, the table size must be a prime number, so you must round up the the requested size to the nearest prime number greater than or equal ton. (We provide a function calledint roundUpPrime(int n)that computes this value for you.) TheHashTableobject created by the constructor should be ready for insertion, search and deletion without any additional initialization. This is the destructor. Make sure you deallocate all memory for this object. The strings in the hash table must be deallocated usingfree()since they are C strings (i.e., dont useremovefor C strings). This is the copy constructor for theHashTableclass. If theotherhash table is not in incremental rehash, then copying the hash table is very straightforward. Just make sure you allocate memory for the new hash table and usestrdup()to copy the strings. On the other hand, if theotherhash table has an ongoing rehash, then it doesnt make sense to make a duplicate of a hash table that is in the middle of being copied. Instead, force theothertable to finish its rehashing and copy over the resulting single table. Note that the usualconstmodifier has been removed from the parameter of this copy constructor, since theHashTableobject in the parameter might be asked to finish incremental rehashing. (There are other ways to resolve this, but they are too cumbersome.) Do not call the assignment operator from the copy constructor. If you do not want to have duplicate code, then create a third function that handles the common parts of the both functions. This is the overloaded assignment operator for theHashTableclass. As with the copy constructor, the process is fairly standard if therhsdoes not have an ongoing incremental rehash. If there is an ongoing rehash, then force therhsto finish its rehashing and copy over the resulting single hash table. Note that the usualconstmodifier has been removed from the parameter of this assignment operator, since theHashTableobject in the parameter might be asked to finish incremental rehashing. This function inserts a copy of the C stringstrinto the hash table. It has no return value. (Note: usestrdup()to copy C strings.) Theinsert()function should insert in the new table if there is an ongoing incremental rehash. Callinginsert()with a string that is already in the hash table should have no effect. (I.e., do not insert a second copy of the same value.) Make sure you dont have a copy of a string that you didnt insert floating around. Thats a memory leak. Theinsert()function should trigger incremental rehashing when appropriate as described above. Theinsert()operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%. Thefind()function looks forstrin the hash table. The function returnstrueif found,falseotherwise. Thefind()function look in both the old and the new hash tables if there is an ongoing incremental rehashing. Thefind()function should trigger incremental rehashing when appropriate as described above. Thefind()operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%. Theremove()function removesstrfrom the hash table and returns the pointer. Ifstris not in the hash table,remove()returns NULL. It is the responsibility of the code that callsremove()to deallocate the string that is returned. (Again, usefree(), notdeleteto deallocate.) Recall that for open addressing, when an item is removed, the slot it occupied should be specially marked as deleted, and not set back toNULL, which would break linear probing. But what special value can you use? You need a pointer value that cannot point to any real C-string. Use the following definition inside yourHashTableclass: Then in the .cpp file, initialize with: The constantDELETEDcan then be stored in your hash table slot. DELETED is a safe sentinel value because no string in the program can have that addressit is occupied by the class variable DELETED itself (in the global data section). Theremove()function should trigger incremental rehashing when appropriate as described above. Theremove()operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%. These functions are used for grading purposes, so we can examine your hash table(s). TheisRehashing()function returnstrueif there is an ongoing incremental rehash. ThetableSize()function returns the size of the hash table. When there is an ongoing rehash,tableSize(0)should return the size of the old table andtableSize(1)should return the size of the new table. Similarly,size()returns the number of items currently in the table(s). Theat()function returns a pointer to the string stored at theindexslot of the hash table specified bytable. If theindexis invalid (i.e., less than 0 or greater than or equal to table size), thenat()should throw anout_of_rangeexception (already defined instdexcept). The pointer returned byat()has typeconst char *to prevent the string stored in the hash table from being changed. The calling function can make a copy if desired. Dump should print some vital statistics and the contents of the hash table(s) tostdout. You should include the table size and number of items in the hash table(s). When you print out the string in each hash slot, include the items hash index in parentheses (see example above.) Implementation Notes Remember to mod out by the table size when you are working with hash table indices. Again, in hash tables, the indices wrap around to 0 at the bottom of the hash table. You must take this into account when you use linear probing for insert, find and remove. One more time, the indices wrap. This means clusters can straddle the end of the hash table. When you move a cluster that is at the beginning or at the end of the table, you have to check if the cluster wraps around. Also, note thatforloops do not work very well in this situation, becausedoesnt do the right thing when the cluster wraps around. If you are thinking of working with negative indices for your hash table, remember that the modulo operator in C/C++ does not work mathematically (i.e., correctly):-2 % 10evaluates to -2, not 8. So, the negative indices will not wrap around to the end of the hash table as you might expect. You are working with C null-terminated strings, not C++ strings. If you havent used C strings in a while, please review. For example, a C string is just an array ofchar, so the type ischar *. A dynamically allocated array of C strings has typechar * *since it is a pointer to an array of pointers tochar. Here areMr. Lupolis notes on C strings. You must use thestrcmp()function to make string comparisons. You cannot use the==operator. That will do pointer comparison, not string comparison. To usestrcmp(), make sure youor Your hash table should make a copy of the string inserted, and not just store the pointer. Copies should be made using thestrdup()function. Thestrdup()function allocates memory usingmalloc()instead ofnew. To deallocate this memory, you must usefree()instead ofdelete. While it is possible to allocate an array ofcharusingnewlike this:it would be very confusing for one program to sometimes usemalloc()and sometimes usenewto allocate strings. So, just stick to usingmalloc()andfree()to allocate and deallocate C strings. Having said that, you dont really need to usemalloc()at all, if you usestrdup()to copy strings, sincestrdup()allocates memory. Finally, for all other memory allocations not involving C strings, just usenew. Some of the strings that you are working with have typeconst char *. This means the string being pointed to cannot be changed. (The string is immutable in Python-speak.) Make sure that you know what this means. If you assign aconst char *pointer to achar *pointer, the compiler will give you an error. You should avoid copying strings whenever possible. For example, when you move a string from the old hash table to the new hash table during a rehash, you do not need to make a copy of the string. This means you probably want to have two versions of theinsert()function. The standard one makes copies (because thats what your client is expecting). The internal one that you use for moving should not make copies. (Yes, one can call the other.) You should periodically run your program undervalgrindduring development. This is so you can catch memory leaks as soon as possible. Also, ifvalgrindcomplains about memory read or memory write errors, this meansyou have a bug in your program. You should not ignore these errors. You should fix the bug as soon as possible because memory errors tend to manifest themselves in other places and are difficult to debug. You want to know if your program has any memory errors as soon as possible, because it is likely that the bug is in the last few modifications you have made in your program. Some common errors to watch out for: forgetting to update the number of items after inserting or removing a string taking modulo of number of items instead of table capacity using the capacity of the wrong table trying tostrdup(DELETED) trying to useDELETEDas a string forgetting to intialize the entries of a new table toNULL Test Programs The following programs should be compiled usingwheretestX.cppis one oftest1.cpp,test2.cpp,test3.cpp,test4.cpportest5.cpp. Run these programs undervalgrindto check for memory leaks and memory read/write errors. Basic test ofinsert(),find()andremove()without triggering incremental rehashing. Code (test1.cpp) and sample output (test1.txt). Rehashing triggered. Clusters moved from old hash table to new hash table. Rehashing ends and only one table remains. Code (test2.cpp) and sample output (test2.txt). Rehashing triggered. Further insertions cause long probe sequence in new table. Rehashing stopped and all items are consolidated in one hash table. Code (test3.cpp) and sample output (test3.txt). Robust test with thousands of calls toinsert(),find()andremove(). Resulting hash table is checked against equivalent STLset. Code (test4.cpp) and sample output (test4.txt). Test of copy constructor and assignment operator. Code (test5.cpp) and sample output (test5.txt). What to Submit You must submit the following files: HashTable.h HashTable.cpp Driver.cpp The main function inDriver.cppshould exercise yourHashTablefunctions and show what your program has implemented successfully. (I.e., if your code inDriver.cppdoes not produce any output or it seg faults, then we will assume that you have not implemented very much.) CMSC 341 Fall 2018 CSEE|UMBC
Pssst…We can write an original essay just for you.
Any essay type. Any subject. We will even overcome a 6 hour deadline.
<< SAVE15 >>
Place your first order with code to get 15% discount right away!